hap.py
hap.py copied to clipboard
FP in phased variant overlapping non-phased variant
Hello,
It seems like hap.py does not properly handle phased variants that overlap with non-phased variants.
Here is an example:
chr20 55234351 . A T 500 PASS AC=1;AF=0.5;AN=2;DA=39;DP=78;GM=1;MC=0;MF=0;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;SC=TGTATGTGTAAATATATATAT;set=variant2-variant3 GT:GQ:PS:UG:UQ 1|0:99:55134343:0/1:147.24
chr20 55234351 rs139242639 AAT A,TAT 405.06 PASS AC=1,1;AF=0.5,0.5;AN=2;DB;DP=17;ExcessHet=3.0103;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;QD=23.83;SOR=1.371;VQSLOD=1.56;culprit=DP;set=variant GT:AD:DP:GQ:PL 1/2:0,11,6:17:99:422,157,163,307,0,444
and here is what hap.py produced for the same chr20:55234351:
chr20 55234351 . AAT TAT,A 50 . BS=55234351;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ 2/1:FN:lm:d1_5,tv:INDEL:hetalt:. ./.:.:.:.:NOCALL:nocall:0
chr20 55234351 . A T 500 . BS=55234351;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 1/1:FP:lm:tv:SNP:homalt:500
chr20 55234351 . AAT A 405.06 . BS=55234351;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:d1_5:INDEL:het:405.06
It looks like the 1|0 turned into 1/1 while evaluating, and puts both the variants as FPs.
On the other hand, when a filtered set was provided:
chr20 55234351 rs139242639 AAT A,TAT 405.1 PASS AC=1,1;AF=0.5,0.5;AN=2;DB;DP=17;ExcessHet=3.0103;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;QD=23.83;SOR=1.371;VQSLOD=1.56;culprit=DP;set=variant GT:AD:DP:GQ:PL 1/2:0,11,6:17:99:422,157,163,307,0,444
Now the non-phased variant becomes TP:
chr20 55234351 . AAT TAT,A 50 . BS=55234351;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ 2/1:TP:gm:d1_5,tv:INDEL:hetalt:405.1 ./.:.:.:.:NOCALL:nocall:0
chr20 55234351 . A T 405.1 . BS=55234351;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:TP:gm:tv:SNP:het:405.1
chr20 55234351 . AAT A 405.1 . BS=55234351;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:TP:gm:d1_5:INDEL:het:405.1
Is this an expected behavior in hap.py?
I am using
commit 9d128a99e85bacd47f1f5da0e774e92e88b5745a
Merge: 46ee23c ec8172c
Author: Dylan Skola <[email protected]>
Date: Thu Feb 13 15:21:32 2020 -0800
Merge pull request #110 from Illumina/issue_109
Added missing bcftools invocation to command line for region processing
and benchmarked against
HG002_GRCh38_1_22_v4.2.1_benchmark.bcf
HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
Thanks, Arang
Hi Arang,
Good question! I think hap.py is behaving as expected here, but it's a little complicated. It appears your first vcf was probably merged from multiple callers, so I think it is essentially calling the same heterozygous SNV twice. hap.py tries to find a way to combine these calls as separate variant calls, and the only way to do that is to say the "heterozygous" SNVs are on opposite haplotypes, which essentially makes a homozygous SNV, and it is counted as a FP because it doesn't match the correct heterozygous genotype for the SNV. When you remove this extra SNV from the vcf, then everything matches the benchmark vcf even though it is represented as a single line instead of 2 lines.
Does that make sense? Thanks! Justin
It's true there is redundancy in the unfiltered example, but the het variant marked as phased is essentially contained in the second (also het) unphased one, so why would hap.py assume it lies on both alleles? If one were to assume it was the same variant, it would make room for the two base deletion contained in the second variant, and everything would be consistent.
Does this conversion to homozygous snp happen during preprocessing stages, or is it part of the comparison engine? What happens if you use the vcfeval engine? When I ran this example directly with rtg vcfeval (i.e. outside of hap.py), the ga4gh intermediate output file contains:
chr20 55234351 . A T . PASS BS=55234351 GT:QQ:BD:BK . 1|0:50.0:FP:lm
chr20 55234351 . AAT A,TAT . PASS BS=55234351 GT:BD:BK:QQ 1/2:TP:gm 1/2:TP:gm:50.0
Which is exactly what I expect it should do -- the spurious het SNP is classified as FP.