hap.py icon indicating copy to clipboard operation
hap.py copied to clipboard

FP in phased variant overlapping non-phased variant

Open arangrhie opened this issue 4 years ago • 3 comments

Hello,

It seems like hap.py does not properly handle phased variants that overlap with non-phased variants.

Here is an example:

chr20	55234351	.	A	T	500	PASS	AC=1;AF=0.5;AN=2;DA=39;DP=78;GM=1;MC=0;MF=0;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;SC=TGTATGTGTAAATATATATAT;set=variant2-variant3	GT:GQ:PS:UG:UQ	1|0:99:55134343:0/1:147.24
chr20	55234351	rs139242639	AAT	A,TAT	405.06	PASS	AC=1,1;AF=0.5,0.5;AN=2;DB;DP=17;ExcessHet=3.0103;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;QD=23.83;SOR=1.371;VQSLOD=1.56;culprit=DP;set=variant	GT:AD:DP:GQ:PL	1/2:0,11,6:17:99:422,157,163,307,0,444

and here is what hap.py produced for the same chr20:55234351:

chr20	55234351	.	AAT	TAT,A	50	.	BS=55234351;Regions=CONF,TS_contained	GT:BD:BK:BI:BVT:BLT:QQ	2/1:FN:lm:d1_5,tv:INDEL:hetalt:.	./.:.:.:.:NOCALL:nocall:0
chr20	55234351	.	A	T	500	.	BS=55234351;Regions=CONF,TS_contained	GT:BD:BK:BI:BVT:BLT:QQ	./.:.:.:.:NOCALL:nocall:.	1/1:FP:lm:tv:SNP:homalt:500
chr20	55234351	.	AAT	A	405.06	.	BS=55234351;Regions=CONF,TS_contained	GT:BD:BK:BI:BVT:BLT:QQ	./.:.:.:.:NOCALL:nocall:.	0/1:FP:lm:d1_5:INDEL:het:405.06

It looks like the 1|0 turned into 1/1 while evaluating, and puts both the variants as FPs.

On the other hand, when a filtered set was provided:

chr20	55234351	rs139242639	AAT	A,TAT	405.1	PASS	AC=1,1;AF=0.5,0.5;AN=2;DB;DP=17;ExcessHet=3.0103;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;QD=23.83;SOR=1.371;VQSLOD=1.56;culprit=DP;set=variant	GT:AD:DP:GQ:PL	1/2:0,11,6:17:99:422,157,163,307,0,444

Now the non-phased variant becomes TP:

chr20	55234351	.	AAT	TAT,A	50	.	BS=55234351;Regions=CONF,TS_contained	GT:BD:BK:BI:BVT:BLT:QQ	2/1:TP:gm:d1_5,tv:INDEL:hetalt:405.1	./.:.:.:.:NOCALL:nocall:0
chr20	55234351	.	A	T	405.1	.	BS=55234351;Regions=CONF,TS_contained	GT:BD:BK:BI:BVT:BLT:QQ	./.:.:.:.:NOCALL:nocall:.	0/1:TP:gm:tv:SNP:het:405.1
chr20	55234351	.	AAT	A	405.1	.	BS=55234351;Regions=CONF,TS_contained	GT:BD:BK:BI:BVT:BLT:QQ	./.:.:.:.:NOCALL:nocall:.	0/1:TP:gm:d1_5:INDEL:het:405.1

Is this an expected behavior in hap.py?

I am using

commit 9d128a99e85bacd47f1f5da0e774e92e88b5745a
Merge: 46ee23c ec8172c
Author: Dylan Skola <[email protected]>
Date:   Thu Feb 13 15:21:32 2020 -0800

    Merge pull request #110 from Illumina/issue_109
    
    Added missing bcftools invocation to command line for region processing

and benchmarked against

HG002_GRCh38_1_22_v4.2.1_benchmark.bcf
HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed

Thanks, Arang

arangrhie avatar Jul 07 '21 18:07 arangrhie

Hi Arang,

Good question! I think hap.py is behaving as expected here, but it's a little complicated. It appears your first vcf was probably merged from multiple callers, so I think it is essentially calling the same heterozygous SNV twice. hap.py tries to find a way to combine these calls as separate variant calls, and the only way to do that is to say the "heterozygous" SNVs are on opposite haplotypes, which essentially makes a homozygous SNV, and it is counted as a FP because it doesn't match the correct heterozygous genotype for the SNV. When you remove this extra SNV from the vcf, then everything matches the benchmark vcf even though it is represented as a single line instead of 2 lines.

Does that make sense? Thanks! Justin

jzook avatar Jul 08 '21 14:07 jzook

It's true there is redundancy in the unfiltered example, but the het variant marked as phased is essentially contained in the second (also het) unphased one, so why would hap.py assume it lies on both alleles? If one were to assume it was the same variant, it would make room for the two base deletion contained in the second variant, and everything would be consistent.

nhansen avatar Jul 08 '21 16:07 nhansen

Does this conversion to homozygous snp happen during preprocessing stages, or is it part of the comparison engine? What happens if you use the vcfeval engine? When I ran this example directly with rtg vcfeval (i.e. outside of hap.py), the ga4gh intermediate output file contains:

chr20   55234351        .       A       T       .       PASS    BS=55234351     GT:QQ:BD:BK     .       1|0:50.0:FP:lm
chr20   55234351        .       AAT     A,TAT   .       PASS    BS=55234351     GT:BD:BK:QQ     1/2:TP:gm       1/2:TP:gm:50.0

Which is exactly what I expect it should do -- the spurious het SNP is classified as FP.

Lenbok avatar Jul 09 '21 05:07 Lenbok