deepvariant icon indicating copy to clipboard operation
deepvariant copied to clipboard

Deepvariant genotype

Open MKaandemir opened this issue 1 year ago • 6 comments

Hi,

Thanks for the great tool. I got the following variant lines. I wonder how should I handle them since they are germline calls. The input bam consist of only disease causing tandem repeat regions. Would I get the following lines if I run whole genome bam? If so, how should I handle these cases?

chr8	118316369	.	CA	C,CAAAAAAAAAAA,CAAAAAAAAAAAA	34.6	PASS	.	GT:GQ:DP:AD:VAF:PL:PS	2|1:4:36:5,9,7,5:0.25,0.194444,0.138889:32,12,45,12,0,46,6,4,46,46:118261886
chr8	118620399	.	CAAAA	C,CA,CAAA	33.3	PASS	.	GT:GQ:DP:AD:VAF:PL:PS	3|1:4:25:2,8,3,6:0.32,0.12,0.24:30,8,43,3,43,43,8,0,10,41:118384801

MKaandemir avatar Jun 25 '24 07:06 MKaandemir

I looked at it again. The first line should be like this, right?

chr8 118316369 . CA C
chr8 118316370 . A AAAAAAAAAAA,AAAAAAAAAAAA

ghost avatar Jun 25 '24 08:06 ghost

Hi @MKaandemir,

chr8	118316369	.	CA	C,CAAAAAAAAAAA,CAAAAAAAAAAAA	34.6	PASS	.	GT:GQ:DP:AD:VAF:PL:PS	2|1:4:36:5,9,7,5:0.25,0.194444,0.138889:32,12,45,12,0,46,6,4,46,46:118261886

In this case, the genotype is 2|1. You can interpret this as:

allele0=ref (CA)
allele1=alt1 (CA->C)
allele2=alt2 (CA->CAAAAAAAAAAA)
allele3=alt3 (CA->CAAAAAAAAAAAA)

As the genotype is 2|1 you can interpret it as: alt2|alt1 So the first haplotype sees:

CA->CAAAAAAAAAAA

And second haplotype sees:

CA->C

So in haplotype-1 you have 10bp insertion of As and 2nd one you have a deletion of 1bp A. You can represent this many ways. However, if you left shift, it would become:

chr8 118316369 . CA C,CAAAAAAAAAAA

Which is equivalent to what you had in the VCF. You can use bcftools norm or something else if you are trying to normalize the variant call. Anyway you represent that gives you the right underlying haplotype should be the right way to represent it unless you are looking for something specific.

kishwarshafin avatar Jun 25 '24 18:06 kishwarshafin

Thanks for the explanation! I'm also curious about why there are four allelic depths. Is the first one the reference allele's depth? In biallelic SNPs, it doesn't show the reference allele. Also, why do you show the 2 allele's genotype but put the 3 allele in the alt column?

ghost avatar Jun 26 '24 07:06 ghost

Hi @MKaandemir from the header you can see the description of each field:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">
##FORMAT=<ID=MED_DP,Number=1,Type=Integer,Description="Median DP observed within the GVCF block rounded to the nearest integer."

Yes, the first value for AD is for the reference allele.

kishwarshafin avatar Jun 27 '24 17:06 kishwarshafin

In the example line, the depth is listed as 36. However, the allelic depths are 5, 9, 7, and 5. The sum of these allelic depths does not equal the value in the DP field.

chr8	118316369	.	CA	C,CAAAAAAAAAAA,CAAAAAAAAAAAA	34.6	PASS	.	GT:GQ:DP:AD:VAF:PL:PS	2|1:4:36:5,9,7,5:0.25,0.194444,0.138889:32,12,45,12,0,46,6,4,46,46:118261886

ghost avatar Jun 28 '24 10:06 ghost

@MKaandemir that means there were more alleles in this position with lower frequency that were dropped by the candidate generation scheme as they do not meet all the heuristics set for an allele to be a candidate. You can read the DeepVariant manuscript to understand the process fully.

kishwarshafin avatar Jun 28 '24 15:06 kishwarshafin

Closing this issue. Please feel free to reopen if you have further questions.

kishwarshafin avatar Jul 03 '24 19:07 kishwarshafin