hap.py icon indicating copy to clipboard operation
hap.py copied to clipboard

Using --usefiltered-truth results in incorrect FP calls for filtered variants in the truth (xcmp)

Open melferink opened this issue 1 year ago • 0 comments

Hi, thanks for the great tool!

I’ve stumbled on a bug that causes FP variant calls when performing a pairwise comparison of VCFs (xcmp mode) while when taking all variants into account (that is, including non-PASS). We've tested Hap.py v0.3.14 (and also present in 0.3.15 looking at the code)

Please correct me if I understood it incorrectly ;) Argument can be --usefiltered-truth used to take filtered variants (non-Pass and non-‘dot’) into account. While this option works It appears that these variants will always result in a FP as the struth will always have the genotype ./.

Example:

Truth VCF (only 1 variant shown): 15 90294306 rs6496598 C G 21269.49 SnpCluster AC=2;AF=1.00;AN=2 GT:AD:DP:GQ:PL 1/1:0,59:59:99:2696,207,0 Query (only 1 variant shown): 15 90294306 rs6496598 C G 21269.49 SnpCluster AC=2;AF=1.00;AN=2 GT:AD:DP:GQ:PL 1/1:0,28:28:99:1440,108,0

Command: hap.py truth.vcf.gz query.vcf.gz --usefiltered-truth --reference <ref_fasta> --threads 2 -R <bed-file> -T <bed_file> -o output

I’ve tested different varieties of options (including --preprocess-truth) and the result was always the same:

15 90294306 . C G 21269.5 SnpCluster BS=90294306 GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 1/1:FP:.:tv:SNP:homalt:21269.5

Although the position itself was in the output (this is not the case when --usefiltered-truth was not used), the truth genotype is ./. or NOCALL, and thus a FP in this specific case.

Diving in the code I’ve noticed that the bug seems to be in the xcmp command as option “--apply-filters-truth False” should have been passed to this command when “--usefiltered-truth” is used.

Example: Using some intermediate chunk of the original run I’ve test xcmp: again with two VCF files (filter.vcf.gz and pass.vcf.gz) and focusing on a single variant (15: 90294306). For filter.vcf.gz the filter status for that variant is SnpCluster, for pass.vcf.gz this is PASS

filter.vcf.gz 15 90294306 . C G 21269.5 SnpCluster AF=1 GT:AD:ADO:DP:GQ:PL 1/1:0,59:0:59:99:0

pass.vcf.gz 15 90294306 . C G 21269.5 PASS AF=1 GT:AD:ADO:DP:GQ:PL 1/1:0,28:0:28:99:0

Command: xcmp filter.vcf.gz pass.vcf.gz -o test_merge_xcmp.bcf -r Homo_sapiens.GRCh37.GATK.illumina.fasta Output: 15 90294306 . C G 21269.5 . AF=1;BS=90294306;IQQ=21269.5;ctype=simple:mismatch;gtt2=gt_homalt;kind=missing;type=FP GT ./. 1/1

Using the --apply-filters-truth False filter results in the expected output. Command: xcmp filter.vcf.gz pass.vcf.gz -o test_merge_xcmp.bcf -r Homo_sapiens.GRCh37.GATK.illumina.fasta --apply-filters-truth False

Output: 15 90294306 . C G 21269.5 SnpCluster AF=1;BS=90294306;IQQ=21269.5;ctype=simple:match;gtt1=gt_homalt;gtt2=gt_homalt;kind=match;type=TP GT 1/1 1/1

Btw. (sanity check for myself): using pass.vcf.gz as truth always results in a correct output, regardless is -apply-filters-truth is used

For me it makes sense to include “--apply-filters-truth False“ in the xcmp command if --usefiltered-truth is used in the main script. If I look into the code, this does not seems to be the case/ an option. Could you add this to the code, or am I missing something here?

melferink avatar Jul 24 '24 07:07 melferink