bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Fix GT splitting in bcftools norm

Open jpiper opened this issue 8 years ago • 6 comments

Hi All,

Second time lucky (first pull request was polluted with stuff). It appears to me there's a pretty serious flaw with GT flag splitting in the normalisation. This is what I'm seeing:

Splitting multiallelic sites such as

chr1    113938929   .   T   A,C 52.00   LowGQX;HighDPFRatio SNVSB=0.0;SNVHPOL=21    GT:GQ:GQX:DP:DPF:AD 1/2:7:5:5:8:0,4,1

with a GT of 1/2 with bcftools norm -m- <in> incorrectly yields two records with a GT 1/0 and 0/1, which should really be be 1/. and ./1

chr1    113938929   .   T   A   52  LowGQX;HighDPFRatio SNVSB=0;SNVHPOL=21  GT:GQ:GQX:DP:DPF:AD 1/0:7:5:5:8:0,4
chr1    113938929   .   T   C   52  LowGQX;HighDPFRatio SNVSB=0;SNVHPOL=21  GT:GQ:GQX:DP:DPF:AD 0/1:7:5:5:8:0,1

In the integration tests, I also notice sites with a 2 getting split to a 0 and a 1 site...

20  275 .   A   C,G 999 PASS    INDEL;AN=2;AC=0,2   GT:PL:DP:FGF:FGI:FGS:FSTR   2:0,0,0:0:1e+06,2e+06,3e+06:1111,2222,3333:A,BB,CCC:WORD    2:0,0,0:0:1e+06,2e+06,3e+06:1111,2222,3333:A,BB,CCC:WORD

to

20  275 .   A   C   999 PASS    INDEL;AN=2;AC=0 GT:PL:DP:FGF:FGI:FGS:FSTR   0:0,0:0:1e+06,2e+06:1111,2222:A,BB:WORD 0:0,0:0:1e+06,2e+06:1111,2222:A,BB:WORD
20  275 .   A   G   999 PASS    INDEL;AN=2;AC=2 GT:PL:DP:FGF:FGI:FGS:FSTR   1:0,0:0:1e+06,3e+06:1111,3333:A,CCC:WORD    1:0,0:0:1e+06,3e+06:1111,3333:A,CCC:WORD

which should surely be . and 1 (For humans, you wouldn't expect a haploid call in this specific region - the spec says we would only expect this in M or Y for humans)

I have patched here to replace

gt[j] = bcf_gt_unphased(0) | bcf_gt_is_phased(gt[j]); // set to REF

with

gt[j] = bcf_gt_missing | bcf_gt_is_phased(gt[j]); // set to REF

and altered the expected output from the unit tests accordingly, in case you find this is actually a bug rather than intended behaviour.

jpiper avatar Oct 28 '15 10:10 jpiper

Technically shouldn't it be the symbolic allele <*>, the dot allele being missing data?

mp15 avatar Oct 28 '15 11:10 mp15

Not <*>, but maybe *.

pd3 avatar Oct 28 '15 11:10 pd3

Ideally, yes. Unfortunately the VCF specification only allows non-negative integers or . in this field though, so this was the best I could do without breaking spec

jpiper avatar Oct 28 '15 11:10 jpiper

I think @mp15 meant to use * ALT allele which is a valid way how to specify missing alleles due to conflicting variants. A valid VCF notation then would be:

REF=A  ALT=C,*  GT=0/2

pd3 avatar Oct 28 '15 12:10 pd3

Hello @mp15 @pd3 @jpiper

I am using bcftools norm at the moment and I think I have observed a problematic case during my use of the behaviour of bcftools norm for GT splitting, just as @jpiper described.

I am splitting multi allelic haploid sites to filter on the ratio of alternate allele observation counts over read depth. However when I try merging bi allelic sites into multi allelic sites after the filtering, the conflicting GT are badly resolved for everything but the first allele I think.

Consider this :

NC_000962.3	1849	.	C	A,T	3709.44	PASS	INFO_STUFF	GT:DP:AD:RO:QR:AO:QA:GL	0:247:247,0,0:247:9070:0,0:0,0:0,-816.14,-816.14	1:58:1,57,0:1:29:57,0:2142,0:-190.113,0,-192.721	1:38:18,20,0:0:0:38,0:1380,0:-124.453,0,-124.453	2:42:12,0,30:0:0:0,42:0,1578:-142.262,-142.262,0

(I modified by hand the AD value for my test cases)

If I filter to keep only calls supported by more than 0.75 of the observation with the following command :

bcftools norm -m -any input.vcf | bcftools filter -s "freq" -S "." -e "GT='a' & FORMAT/AD[1]/FORMAT/DP < 0.75" | bcftools norm -m +any

I get the following result:

NC_000962.3	1849	.	C	A,T	3709.44	freq	INFO_STUFF	GT:DP:AD:RO:QR:AO:QA:GL	0:247:247,0,0:247:9070:0,0:0,0:0,-816.14,-816.14	1:58:1,57,0:1:29:57,0:2142,0:-190.113,0,-192.721	.:38:18,20,0:0:0:38,0:1380,0:-124.453,0,-124.453	0:42:12,0,30:0:0:0,42:0,1578:-142.262,-142.262,0	

The third sample was correctly set to GT='.' whether the fourth one is now "0" where I would like him to be "."

By any chance do you think the proposed enhancement here is coming for bcftools ? Or by chance would you know an easier way of filtering each sample based with an ALT call based on the frequency of the most frequent allele observation like this ?

For now the only fix I have thought about is adding an extra filtering for the reference calls that seem to be "erroneous", such as this :

bcftools filter -s "feq" -S "." -e "GT='r' & FORMAT/AD[0]/FORMAT/DP < 1-0.75"

but I think this might be a bit dangerous. Like this, both entries for the fourth sample in the MWE I presented are set to "." and stay "." when merged.

Thanks a lot for your help and thanks a lot for the tool !

sachalau avatar Feb 26 '18 16:02 sachalau

I think its a fundamental problem with representing multi allelics as binary. Three alleles A,B,C Gives the following genotypes:

alleles A B C
A A/A A/B A/C
B B/B B/C
C C/C

If we try and convert to two bi-allelics using A as the reference we can represent the following genotypes

multiple biallelics R/R R/A A/A
alt=B A/A A/B B/B
alt=C A/A A/C C/C

We lose the non-reference mutual heterozygotes i.e. the B/C genotype class plus the AA genotype gets represented twice.

The * allele code is for long deletions causing a third allele. But what we if we have two distinct long deletions overlapping a SNP site?

Variant graphs will sort this out in the long term.

Short term: Filter them out and hope it's not a signal lost for analysis in GWAS.

mdkeehan avatar Aug 21 '19 22:08 mdkeehan