Plink in Python


I would like to use PLINK data files (.bed, .fam, .map) etc. in Python. There is a module called pyplink.

Install

Can install with pip

pip install pyplink

How to use

Assuming that there are a series of files such as foo.bed foo.fam foo.bim in the current directory

from pyplink import PyPlink
pyp = PyPlink("foo")

Now we have an object called pyp. This is an object that combines .bed, .fam, and .bim files. Each member information can access each information.

pyp.get_fam()
pyp.get_nb_samples()
pyp.get_bim()
pyp.get_nb_markers()
markerNames = pyp.get_bim().iloc[:,5]

Specify marker name to obtain genotype.

If you set it to acgt you can get base information.

pyp.get_geno_marker(markerNames[0])
pyp.get_acgt_geno_marker(markerNames[0])

It is also possible to obtain the marker ID and genotype as an iterator.

markers = ["rs7092431", "rs9943770", "rs1578483"]
for marker_id, genoypes in pyp.iter_geno_marker(markers):
  print(marker_id)
  print(genotypes, end="\n\n")

Sample script

About all markers on chromosome 23 We obtain all genotypes of male samples

for marker_ID, genotypes in pyp.iter_geno_marker(y_markers):
    male_genotypes = genotypes[males]
    print("{:d} total genotypes".format(len(genotypes)))
    print("{:d} genotypes for {:,d} males ({} on chr{} and position {:,d})".format(
        len(male_genotypes),
        males.sum(),
        marker_ID,
        all_markers.loc[marker_ID, "chrom"],
        all_markers.loc[marker_ID, "pos"],
    ))
    break

Get the Minor allele frequency and genotype of the specified marker

founders = (all_samples.father == "0") & (all_samples.mother == "0")
markers = ["rs7092431", "rs9943770", "rs1587483"]
for marker_ID, genotypes in pyp.iter_geno_marker(markers):
    valid_genotypes = genotypes[founders.values & (genotypes != -1)]
    maf = valid_genotypes.sum()/(len(valid_genotypes)*2)
    print(marker_ID, round(maf, 6), sep="\t")
    print(genotypes)