popscle icon indicating copy to clipboard operation
popscle copied to clipboard

plp file

Open jjenny opened this issue 2 years ago • 2 comments

Hi,

Can you elaborate what is in the .plp file created by dsc-pileup? Specifically how to interpret the 'Alleles' and 'Baseqs' column.

Thanks.

jjenny avatar Apr 01 '22 13:04 jjenny

We were also wondering the same thing. We are trying to use freemuxlet to call the genotypes of single cells. We were wondering about the ALLELES column in the .plp.gz file. Can you explain to us what the values in this column mean exactly? We saw that this is some sequence of 0/1/2 digits, but we are not sure we understand what they mean. The information that we would like to extract is for each combination of SNP+cell, how many reads there are that support either of the two alleles (the reference vs. alternative allele), like you'd normally find in a pVCF file. Is this information available somewhere?

nadavbra avatar Jun 16 '22 22:06 nadavbra

  • In the ALLELES column, 0/1/2 represents alleles specified in the input VCF file  

    • 0 - REF allele 
    • 1 - ALT allele 
    • 2 - other alleles (non-REF, non-ALT)
  • The next column contains base qualities encoded in Phred-scale. This is a typical convention in SAM format.    

    • To change the character into phred-scale quality, take ASCII code of the character and subtract 33 from it.   
    • To change the phred-scale quality into base call error rate, use pow(0,1 x/10), when x is the phred-scale quality. You may want to have lower bound on the error rate by capping x to 40 or 50.

hyunminkang avatar Jul 03 '22 16:07 hyunminkang

@hyunminkang thanks for the explanation. How can we interpret when there are long strings of 0/1 in the ALLELES column and F in the BASEQS column? Is this a normal output?. For example

11 53 1 F 36 53 00 FF 41 53 0 : 42 53 1 F 90 53 0 F 91 53 0 F 92 53 010 FFF 106 53 1100 FFFF 138 53 0 F 153 53 0 F 172 53 0 F 187 53 11 FF 190 53 0 F 192 53 0 F 194 53 0 F 208 53 0 F 267 53 010 FFF 300 53 010 FFF

Thanks!

josemovi avatar Apr 20 '23 15:04 josemovi

As I mentioned above it contains base qualities encoded in Phred-scale (ASCII with offset 33). This is a typical convention in SAM format. Please see the SAM specifications for details.

hyunminkang avatar Apr 20 '23 16:04 hyunminkang

Thanks for your quick answer. I understand that the BASEQS column contains base qualities (in my example, contains Fs). What I find hard to understand is the meaning of the long strings with zeroes and ones in the ALLELES column. I would expect to find in each ALLELES filed a single 1 or 0 or 2. What does it mean to have 1100 in an ALLELES field? I’m trying to quantify unique SNPs per sample. Many thanks


From: Hyun Min Kang @.> Sent: Thursday, April 20, 2023 5:12 pm To: statgen/popscle @.> Cc: josemovi @.>; Comment @.> Subject: Re: [statgen/popscle] plp file (Issue #54)

As I mentioned above it contains base qualities encoded in Phred-scale (ASCII with offset 33). This is a typical convention in SAM format. Please see the SAM specifications for details.

— Reply to this email directly, view it on GitHubhttps://github.com/statgen/popscle/issues/54#issuecomment-1516601700, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALPIJ274MMYIPXCZUSP56LDXCFN7DANCNFSM5SIXE33A. You are receiving this because you commented.Message ID: @.***>

josemovi avatar Apr 20 '23 22:04 josemovi

I believe that it is explained in the thread above too?

In the ALLELES column, 0/1/2 represents alleles specified in the input VCF file
0 - REF allele 1 - ALT allele 2 - other alleles (non-REF, non-ALT)

The order of alleles and baseq matches and the order themselves are not particularly important but should be consistent to the order of reads.

hyunminkang avatar Apr 20 '23 22:04 hyunminkang

My apologies if I'm note being clear explaining my issue.

This is the ouput from one of the pipleup files (in bold ALLELES column)

#DROPLET_ID SNP_ID ALLELES BASEQS 260 57 011 :FF 261 57 111111 F:FFFF 263 57 01 FF 265 57 1 F 266 57 0 F 267 57 101110111011101111011010001010001000101110001111101 FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 270 57 0 F 271 57 01 FF 272 57 11000 FFFFF 273 57 111001 FFFFFF 276 57 1 F

These long strings of 0&1 in the ALLELES columns are puzzling me as in the variant file *pileup.var.gz the SNP 57 reads:

#SNP_ID CHROM POS REF ALT AF 57 chr1 1013541 T C 0.95190

And the same SNP in the VCF file that I used initially reads:

chr1 1013541 1:948921 T C . PASS AC=58;AF=0.9519;AN=60;MAF=0.0481;R2=0.73447

Could it be that the string of 0s&1s (instead of a single 1 or 0) in the pileup ALLELES column means that different reads have different alleles from the same cell? or is there anything else happening? thanks

josemovi avatar Apr 21 '23 09:04 josemovi

It looks that the variant contains heterozygotes in the particular cell. I am not sure what the problem is here.

hyunminkang avatar Apr 21 '23 10:04 hyunminkang