I would like to use PLINK data files (.bed, .fam, .map) etc. in Python. There is a module called pyplink.
Install
Can install with pip
pip install pyplink
How to use
Assuming that there are a series of files such as foo.bed foo.fam foo.bim in the current directory
from pyplink import PyPlink
pyp = PyPlink("foo")
Now we have an object called pyp. This is an object that combines .bed, .fam, and .bim files. Each member information can access each information.
pyp.get_fam()
pyp.get_nb_samples()
pyp.get_bim()
pyp.get_nb_markers()
markerNames = pyp.get_bim().iloc[:,5]
Specify marker name to obtain genotype.
If you set it to acgt you can get base information.
pyp.get_geno_marker(markerNames[0])
pyp.get_acgt_geno_marker(markerNames[0])
It is also possible to obtain the marker ID and genotype as an iterator.
markers = ["rs7092431", "rs9943770", "rs1578483"]
for marker_id, genoypes in pyp.iter_geno_marker(markers):
print(marker_id)
print(genotypes, end="\n\n")
Sample script
About all markers on chromosome 23 We obtain all genotypes of male samples
for marker_ID, genotypes in pyp.iter_geno_marker(y_markers):
male_genotypes = genotypes[males]
print("{:d} total genotypes".format(len(genotypes)))
print("{:d} genotypes for {:,d} males ({} on chr{} and position {:,d})".format(
len(male_genotypes),
males.sum(),
marker_ID,
all_markers.loc[marker_ID, "chrom"],
all_markers.loc[marker_ID, "pos"],
))
break
Get the Minor allele frequency and genotype of the specified marker
founders = (all_samples.father == "0") & (all_samples.mother == "0")
markers = ["rs7092431", "rs9943770", "rs1587483"]
for marker_ID, genotypes in pyp.iter_geno_marker(markers):
valid_genotypes = genotypes[founders.values & (genotypes != -1)]
maf = valid_genotypes.sum()/(len(valid_genotypes)*2)
print(marker_ID, round(maf, 6), sep="\t")
print(genotypes)