pandas-plink icon indicating copy to clipboard operation
pandas-plink copied to clipboard

Chromosome names for X, Y and MT?

Open dbolser opened this issue 6 months ago • 0 comments

Sorry if I'm doing something wrong, but when I use plink2 ... --recode vcf I get chromosomes called 21, 22, X, Y and even MT... However, using read_plink(files), they are encoded as 21, 22, 23, 24 and 25.

I know this encoding is expected: https://www.cog-genomics.org/plink/1.9/input

Given diploid autosomes, the remaining modifiers let you indicate the absence of specific non-autosomal chromosomes, as an extra sanity check on the input data. Note that, when there are n autosome pairs, the X chromosome is assigned numeric code n+1, Y is n+2, XY (pseudo-autosomal region of X) is n+3, and MT (mitochondria) is n+4.

However, is there a way to 'fix it' in the output like recode vcf does?

I don't see anything in the documentation about this...

I'm currently writing files out as:

...
# Find the SNVs

p = bim.a0.str.len() == 1
q = bim.a1.str.len() == 1

snv = bim[p & q]

print("SNVs:", snv.shape)

snv.to_csv("sensible_name.tsv", sep="\t", columns=["chrom", "pos", "snp", "a0", "a1"], index=False)

So trying to avoid going in and messing with the DataFrame the array line by line...

dbolser avatar Dec 06 '23 17:12 dbolser