cyvcf2 icon indicating copy to clipboard operation
cyvcf2 copied to clipboard

gt_depth unequal to DP from FORMAT Column in vcf

Open roselucia opened this issue 5 years ago • 6 comments

Hi,

I have a question regarding gt_depths. On the website website "https://gemini.readthedocs.io/en/latest/content/database_schema.html#the-variant-impacts-table" it says, that gt_depths is "A compressed binary vector of the depth of aligned sequence observed for each sample Extracted from the VCF DP genotype tag". However, comparing gt_depth to the DP value of the vcf in the Format column it seems, as if gt_depth is been extracted from the AD value (ADref+ADalt). And in the relevant source code indeed depth is described as the sum of gt_alt_depth and gt_ref_depth https://github.com/brentp/cyvcf2/blob/06f19fc84548aa8303a14dc70fb9693e25857cd4/cyvcf2/cyvcf2.pyx#L1669

However, to my understanding AD Alt + AD Ref is not always equal to DP. The Illumina Tech Support explained it to me like this: sum AD Ref + AD Alt < DP „I've been looking into this, there are examples of DP being greater than AD for the alt allele. Allele depth will only account for reads assigned to one of the alleles which have been considered for the genotype. Some reads may have an allele which is not considered, which would get counted in DP but not in the sum of AD values. For example if ref is A, alt is C, there may be some T's and G's which may be present as errors." sum AD Ref + AD Alt > DP „While the sample-level (FORMAT) DP field describes the total depth of reads that passed the Unified Genotyper's internal quality control metrics (like MAPQ > 17, for example), the AD values (one for each of REF and ALT fields) is the unfiltered count of all reads that carried with them the REF and ALT alleles. The reason for this distinction is that the DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL. Please note, however, that the AD isn't necessarily calculated exactly for indels. Only reads which are statistically favoring one allele over the other are counted. Because of this fact, the sum of AD may be different than the individual sample depth, especially when there are many non-informatice reads. Because the AD includes reads and bases that were filtered by the Unified Genotyper and in case of indels is based on a statistical computation, one should not base assumptions about the underlying genotype based on it; instead, the genotype likelihoods (PLs) are what determine the genotype calls.“

Is there a chance to obtain the depth derived from the DP (Sample FORMAT)?

Thanks a lot! All the best, Rose

roselucia avatar Nov 05 '19 10:11 roselucia

hi, you are correct that cyvcf2 uses AD to get the gt_depths. you can get the DP using:

dps = variant.format("DP")

brentp avatar Nov 05 '19 13:11 brentp

Hi @brentp Thanks for your response. Do I understand you correct, that I can use "dps_samplename" in my Gemini query to obtain the DP of the Format column? Could I also get the variant frequency of the format column? If yes what would I use for the Gemini query? Thanks a lot for the Help! All the best, Rose

roselucia avatar Nov 06 '19 08:11 roselucia

hi, you can not get the DP information with gemini, you'd have to use cyvcf2 on the VCF.

for the variant frequency, you can use gt_alt_freqs in gemini.

brentp avatar Nov 06 '19 13:11 brentp

Hi, thanks for the response. I will stick to gt_depths and gt_alt_freqs then and just keep in mind that this differers form FORMAT DP und FORMAT VF of the original vcf. Thanks a lot! Rose

roselucia avatar Nov 08 '19 11:11 roselucia

Hi! Many, many thanks for a great library!

On a similar note as for gt_depths, gt_alt_freqs seems defined gt_alt_depths / (gt_alt_depths + gt_ref_depths). It turned out a pitfall for us as we expected it to be GT.AF, but I can see how you might want either of these. When you have time, it would also not be bad with the API doc briefly stating this, or maybe just a link over to the Gemini docs.

Could an additional function or option to gt_alt_freqs that does return GT.AF, INFO.AF or gt_alt_depths / (gt_alt_depths + gt_ref_depths) as currently counted from ADs if needed. It should be fine working around it, but it would be faster in the API.

dnil avatar Oct 15 '21 07:10 dnil

Hi, a pull request with the changes you would like would be the best way to get this in.

brentp avatar Oct 15 '21 08:10 brentp