bcftools
bcftools copied to clipboard
Fix GT splitting in bcftools norm
Hi All,
Second time lucky (first pull request was polluted with stuff). It appears to me there's a pretty serious flaw with GT flag splitting in the normalisation. This is what I'm seeing:
Splitting multiallelic sites such as
chr1 113938929 . T A,C 52.00 LowGQX;HighDPFRatio SNVSB=0.0;SNVHPOL=21 GT:GQ:GQX:DP:DPF:AD 1/2:7:5:5:8:0,4,1
with a GT of 1/2
with bcftools norm -m- <in>
incorrectly yields two records with a GT 1/0
and 0/1
, which should really be be 1/.
and ./1
chr1 113938929 . T A 52 LowGQX;HighDPFRatio SNVSB=0;SNVHPOL=21 GT:GQ:GQX:DP:DPF:AD 1/0:7:5:5:8:0,4
chr1 113938929 . T C 52 LowGQX;HighDPFRatio SNVSB=0;SNVHPOL=21 GT:GQ:GQX:DP:DPF:AD 0/1:7:5:5:8:0,1
In the integration tests, I also notice sites with a 2
getting split to a 0
and a 1
site...
20 275 . A C,G 999 PASS INDEL;AN=2;AC=0,2 GT:PL:DP:FGF:FGI:FGS:FSTR 2:0,0,0:0:1e+06,2e+06,3e+06:1111,2222,3333:A,BB,CCC:WORD 2:0,0,0:0:1e+06,2e+06,3e+06:1111,2222,3333:A,BB,CCC:WORD
to
20 275 . A C 999 PASS INDEL;AN=2;AC=0 GT:PL:DP:FGF:FGI:FGS:FSTR 0:0,0:0:1e+06,2e+06:1111,2222:A,BB:WORD 0:0,0:0:1e+06,2e+06:1111,2222:A,BB:WORD
20 275 . A G 999 PASS INDEL;AN=2;AC=2 GT:PL:DP:FGF:FGI:FGS:FSTR 1:0,0:0:1e+06,3e+06:1111,3333:A,CCC:WORD 1:0,0:0:1e+06,3e+06:1111,3333:A,CCC:WORD
which should surely be .
and 1
(For humans, you wouldn't expect a haploid call in this specific region - the spec says we would only expect this in M or Y for humans)
I have patched here to replace
gt[j] = bcf_gt_unphased(0) | bcf_gt_is_phased(gt[j]); // set to REF
with
gt[j] = bcf_gt_missing | bcf_gt_is_phased(gt[j]); // set to REF
and altered the expected output from the unit tests accordingly, in case you find this is actually a bug rather than intended behaviour.
Technically shouldn't it be the symbolic allele <*>, the dot allele being missing data?
Not <*>
, but maybe *
.
Ideally, yes. Unfortunately the VCF specification only allows non-negative integers or .
in this field though, so this was the best I could do without breaking spec
I think @mp15 meant to use *
ALT allele which is a valid way how to specify missing alleles due to conflicting variants. A valid VCF notation then would be:
REF=A ALT=C,* GT=0/2
Hello @mp15 @pd3 @jpiper
I am using bcftools norm at the moment and I think I have observed a problematic case during my use of the behaviour of bcftools norm for GT splitting, just as @jpiper described.
I am splitting multi allelic haploid sites to filter on the ratio of alternate allele observation counts over read depth. However when I try merging bi allelic sites into multi allelic sites after the filtering, the conflicting GT are badly resolved for everything but the first allele I think.
Consider this :
NC_000962.3 1849 . C A,T 3709.44 PASS INFO_STUFF GT:DP:AD:RO:QR:AO:QA:GL 0:247:247,0,0:247:9070:0,0:0,0:0,-816.14,-816.14 1:58:1,57,0:1:29:57,0:2142,0:-190.113,0,-192.721 1:38:18,20,0:0:0:38,0:1380,0:-124.453,0,-124.453 2:42:12,0,30:0:0:0,42:0,1578:-142.262,-142.262,0
(I modified by hand the AD value for my test cases)
If I filter to keep only calls supported by more than 0.75 of the observation with the following command :
bcftools norm -m -any input.vcf | bcftools filter -s "freq" -S "." -e "GT='a' & FORMAT/AD[1]/FORMAT/DP < 0.75" | bcftools norm -m +any
I get the following result:
NC_000962.3 1849 . C A,T 3709.44 freq INFO_STUFF GT:DP:AD:RO:QR:AO:QA:GL 0:247:247,0,0:247:9070:0,0:0,0:0,-816.14,-816.14 1:58:1,57,0:1:29:57,0:2142,0:-190.113,0,-192.721 .:38:18,20,0:0:0:38,0:1380,0:-124.453,0,-124.453 0:42:12,0,30:0:0:0,42:0,1578:-142.262,-142.262,0
The third sample was correctly set to GT='.' whether the fourth one is now "0" where I would like him to be "."
By any chance do you think the proposed enhancement here is coming for bcftools ? Or by chance would you know an easier way of filtering each sample based with an ALT call based on the frequency of the most frequent allele observation like this ?
For now the only fix I have thought about is adding an extra filtering for the reference calls that seem to be "erroneous", such as this :
bcftools filter -s "feq" -S "." -e "GT='r' & FORMAT/AD[0]/FORMAT/DP < 1-0.75"
but I think this might be a bit dangerous. Like this, both entries for the fourth sample in the MWE I presented are set to "." and stay "." when merged.
Thanks a lot for your help and thanks a lot for the tool !
I think its a fundamental problem with representing multi allelics as binary. Three alleles A,B,C Gives the following genotypes:
alleles | A | B | C |
---|---|---|---|
A | A/A | A/B | A/C |
B | B/B | B/C | |
C | C/C |
If we try and convert to two bi-allelics using A as the reference we can represent the following genotypes
multiple biallelics | R/R | R/A | A/A |
---|---|---|---|
alt=B | A/A | A/B | B/B |
alt=C | A/A | A/C | C/C |
We lose the non-reference mutual heterozygotes i.e. the B/C genotype class plus the AA genotype gets represented twice.
The * allele code is for long deletions causing a third allele. But what we if we have two distinct long deletions overlapping a SNP site?
Variant graphs will sort this out in the long term.
Short term: Filter them out and hope it's not a signal lost for analysis in GWAS.