duphold icon indicating copy to clipboard operation
duphold copied to clipboard

duphold use with gVCF files

Open am8265 opened this issue 3 years ago • 4 comments

Hello @brentp,

Thanks for this awesome tool! It would be great, if duphold could be used to compute (B Allele freq /DHGT field) from gVCF files(where non-variant sites are represented as blocks to save space) directly, rather than be limited to regular VCF files.

Thanks and let us know if you have plans to implement such a feature.

am8265 avatar Feb 06 '21 08:02 am8265

Hi @brentp, just following up on this. Do you think it will be possible to implement such a feature in the near future?

am8265 avatar Mar 24 '21 21:03 am8265

you mean to still send a CRAM/BAM, but to evaluate every site in a GVCF? I suppose that's possible, but duphold is for larger events and most gvcfs don't contain large SVs, do they?

brentp avatar Mar 25 '21 06:03 brentp

Hi @brentp , CRAM/BAM file is used for coverage-based QC (DHFC, DHBFC, DHFFC) while bcf(multisample/ single -sample) file is used for snp/indel annotation i.e. compute DHGT. Therefore, In the context of snp/indel annotation, I would suggest, duphold has a feature to use gVCF file i.e a file of the following format ---

image

where 1 14605-14609, 1-14611-14652 is represented as a single block (saves disk space) of non-variant sites. This is followed by variant only site 1-14653. Presently, 1) If I use these non-variant sites information for DHGTing a SV overlapping here, then I have to convert these gvcf files to a regular vcf files(which takes up lot of space). Secondly, 2) just using a variant-sites only vcf file(single /multisample) may not be sufficient estimator of Quality of a SV deletion---i.e. reject a SV del if we find N number of het calls within it. 3) If we use a stretch of non-variant sites from a gvcf file to accurately compute no. of Homozygous ref, then that would complete snp/indel based filtering of SV Dels.

Ofcourse, we can infer the number of non-variant sites from a regular variant sites only whole genome vcf file overlapping with a DEL SV of interest, but in our pipeline we generate these gVCF files and it would help us to use duphold directly with them.

Please let me know what you think? May I suggest also as a first pass to make duphold use the variant-sites (i.e. hom alt, het with GTs like these 1/1, 0/1, 1/2 etc.) directly from a gvcf file. Additionally, if you have time later please use non-variant blocks or false variant sites (with GTs 0/0) to compute hom-ref from gvcf file.

am8265 avatar Apr 02 '21 20:04 am8265

Since DHGT is not the primary feature of duphold (primary is depth annotation of SVs), I'm less inclined to work on this. It's also not clear to me what genotype information would be in non-variant blocks. If there's a variant, it will appear in a final, jointly-called VCF. I guess with the GVCF, you can have a block where there is no variant in any sample in the population (including the current sample of interest) and you get extra info that way because you know there is a block of hom-ref with decent coverage. Is that the benefit you see?

All that said, I would look into this if you have evidence that DHGT is valuable. I always found that depth was more reliable than DHGT.

brentp avatar Apr 07 '21 17:04 brentp