hap.py
                                
                                 hap.py copied to clipboard
                                
                                    hap.py copied to clipboard
                            
                            
                            
                        Using --usefiltered-truth results in incorrect FP calls for filtered variants in the truth (xcmp)
Hi, thanks for the great tool!
I’ve stumbled on a bug that causes FP variant calls when performing a pairwise comparison of VCFs (xcmp mode) while when taking all variants into account (that is, including non-PASS). We've tested Hap.py v0.3.14 (and also present in 0.3.15 looking at the code)
Please correct me if I understood it incorrectly ;) Argument can be --usefiltered-truth used to take filtered variants (non-Pass and non-‘dot’) into account. While this option works It appears that these variants will always result in a FP as the struth will always have the genotype ./.
Example:
Truth VCF (only 1 variant shown):
15	90294306	rs6496598	C	G	21269.49	SnpCluster	AC=2;AF=1.00;AN=2	GT:AD:DP:GQ:PL	1/1:0,59:59:99:2696,207,0
Query (only 1 variant shown):
15	90294306	rs6496598	C	G	21269.49	SnpCluster	AC=2;AF=1.00;AN=2	GT:AD:DP:GQ:PL	1/1:0,28:28:99:1440,108,0
Command:
hap.py  truth.vcf.gz query.vcf.gz --usefiltered-truth --reference <ref_fasta> --threads 2 -R <bed-file> -T <bed_file> -o output
I’ve tested different varieties of options (including --preprocess-truth) and the result was always the same:
15	90294306	.	C	G	21269.5	SnpCluster	BS=90294306	GT:BD:BK:BI:BVT:BLT:QQ	./.:.:.:.:NOCALL:nocall:.	1/1:FP:.:tv:SNP:homalt:21269.5
Although the position itself was in the output (this is not the case when --usefiltered-truth was not used), the truth genotype is ./. or NOCALL, and thus a FP in this specific case.
Diving in the code I’ve noticed that the bug seems to be in the xcmp command as option “--apply-filters-truth False” should have been passed to this command when “--usefiltered-truth” is used.
Example: Using some intermediate chunk of the original run I’ve test xcmp: again with two VCF files (filter.vcf.gz and pass.vcf.gz) and focusing on a single variant (15: 90294306). For filter.vcf.gz the filter status for that variant is SnpCluster, for pass.vcf.gz this is PASS
filter.vcf.gz
15	90294306	.	C	G	21269.5	SnpCluster	AF=1	GT:AD:ADO:DP:GQ:PL	1/1:0,59:0:59:99:0 
pass.vcf.gz
15	90294306	.	C	G	21269.5	PASS	AF=1	GT:AD:ADO:DP:GQ:PL	1/1:0,28:0:28:99:0 
Command:
xcmp filter.vcf.gz pass.vcf.gz -o test_merge_xcmp.bcf -r Homo_sapiens.GRCh37.GATK.illumina.fasta 
Output:
15	90294306	.	C	G	21269.5	.	AF=1;BS=90294306;IQQ=21269.5;ctype=simple:mismatch;gtt2=gt_homalt;kind=missing;type=FP	GT	./.	1/1 
Using the --apply-filters-truth False filter results in the expected output.
Command:
xcmp filter.vcf.gz pass.vcf.gz -o test_merge_xcmp.bcf -r Homo_sapiens.GRCh37.GATK.illumina.fasta --apply-filters-truth False
Output:
15	90294306	.	C	G	21269.5	SnpCluster	AF=1;BS=90294306;IQQ=21269.5;ctype=simple:match;gtt1=gt_homalt;gtt2=gt_homalt;kind=match;type=TP	GT	1/1	1/1 
Btw. (sanity check for myself): using pass.vcf.gz as truth always results in a correct output, regardless is -apply-filters-truth is used
For me it makes sense to include “--apply-filters-truth False“ in the xcmp command if --usefiltered-truth is used in the main script. If I look into the code, this does not seems to be the case/ an option. Could you add this to the code, or am I missing something here?