pandas-plink icon indicating copy to clipboard operation
pandas-plink copied to clipboard

Genotype code wrong!

Open liqingbioinfo opened this issue 3 years ago • 1 comments

Hi there,

 I was surprised to find a major error in your pandas-plink package! I used it to read /bed/bim/fam files into python. But I found that my minor alleles homozygous were coded as 0 and major alleles homozygous were coded as 2!!! That invests all my results! Can you please check your source code to correct such an error! Otherwise, it's really dangerous for others to continue using this pandas-plink package!!!

image

liqingbioinfo avatar Feb 26 '22 00:02 liqingbioinfo

Hi @liqingbioinfo also found this. You can however fix it using ref="a0":

from os.path import join
from pandas_plink import read_plink1_bin
from pandas_plink import get_data_folder

G = read_plink1_bin(join(get_data_folder(), "chr*.bed"), verbose=False, ref="a0")
G[0:5, 0:5].compute() # the first 5x5

Which should get you:

array([[2., 2., 0., 2., 2.],
       [2., 1., 0., 2., 2.],
       [2., 2., 0., 1., 2.],
       [2., 2., 0., 2., 2.],
       [2., 2., 0., 2., 2.]], dtype=float32)

As opposed to using ref="a1" (which is the default):

array([[0., 0., 2., 0., 0.],
       [0., 1., 2., 0., 0.],
       [0., 0., 2., 1., 0.],
       [0., 0., 2., 0., 0.],
       [0., 0., 2., 0., 0.]], dtype=float32)

This is also stated in the docs:

ref (str) – Reference allele. Specify which allele the dosage matrix will count. It can be either "a1" (default) or "a0".

KennethEnevoldsen avatar Apr 11 '22 14:04 KennethEnevoldsen

Is a1 the sensible default? If so, I guess you can close this.

dbolser avatar Jun 19 '24 08:06 dbolser

I will have a look at it later today.

horta avatar Jun 19 '24 14:06 horta

If I’m not mistaken, I chose not to select a reference allele based on its frequency to (i) avoid having to read the entire dataset to decide which allele to use, and (ii) avoid arbitrarily determining a reference allele when the frequencies are identical. Instead, I use a1 by default and allow the user to select a different one if they wish.

horta avatar Jun 19 '24 20:06 horta

It would be great if there is a way to avoid the above confusion. Let me know if there is!

horta avatar Jun 19 '24 20:06 horta

I just wondered if by convention, one or the other is the ref.

On Wed, 19 Jun 2024 at 21:33, Danilo Horta @.***> wrote:

It would be great if there is a way to avoid the above confusion. Let me know if there is!

— Reply to this email directly, view it on GitHub https://github.com/limix/pandas-plink/issues/28#issuecomment-2179424632, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA7NEZ6FSTSEMORWGYELOTZIHTJRAVCNFSM6AAAAABJRRPJL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGQZDINRTGI . You are receiving this because you commented.Message ID: @.***>

dbolser avatar Jun 19 '24 21:06 dbolser