bcftools
bcftools copied to clipboard
feature request: convert MNPs to SNPs
GATK HaplotypeCaller and samtools call these two SNPs at positions i and i+1 (i=763837) for sample NA12878: 20 763837 rs6085879 C T 0/1 20 763838 rs6085880 A G 1/1
In NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz this MNP is present: 20 763837 . CA CG,TG 1|2
vcflib vcfallelicprimitives splits the MNP into two SNPs: 20 763837 . C T 0|1 20 763838 . A G 1|1
It is the output of vcfallelicprimitives, which I desire. I am however not using vcfallelicprimitives, because it changes some of my GT fields from unknown ./. to known despite using the --keep-geno option and it fails to split triallelic SNPs such as 20:678053:G:A,T:1|2.
bcftools norm --multiallelics -both splits the NIST MNP into two MNPs: 20 763837 . CA CG 1|0 20 763837 . CA TG 0|1
vcflib vcfbreakmulti does the same: 20 763837 . CA CG 1|. 20 763837 . CA TG .|1
vt decompose does the same and prints the GT field after a git push today: 20 763837 . CA CG 1/. 20 763837 . CA TG ./1
GATK LeftAlignAndTrimVariants --splitMultiallelics also doesn't print the GT field: 20 763837 . CA TG .|. 20 763838 . A G .|.
GATK VariantsToAllelicPrimitives does not handle multiallelic variants and leaves the MNP untouched: 20 763837 . CA CG,TG 1|2
vt normalize does not seem to handle multiallelic variants (piping the vt decompose output to it also doesn't work): 20 763837 . CA CG,TG 1|2
Do you plan to expand the functionality of bcftools to allow the user to split the NIST MNP into two SNPs (similar to vcfallelicprimitives)?
I ended up writing my own code. Would it be of any help to you, if I share my Python code for doing this? Here is my quick and dirty Python script, which I wrote in a hurry, but I prefer your C code, which is faster, cleaner, more elegant and part of a larger package: https://github.com/team149/tc9/blob/master/WGS/vcf_atomize.py
This would be a useful feature.
Me too, I would use it to fix output from freebayes, see https://github.com/ekg/freebayes/issues/19
I still fantasise about this feature in my dreams at night...
I'd like this feature, as it is easy for me to analysis the variant and integrate into other pipelines.
vt_decompose_blocksub
can do this (I agree it would be an excellent addition to bcftools norm
)
$ bcftools view -H test.vcf.gz
chr20 399074 rs386811500 ACA GCG 88 PASS SNVSB=-13.8;SNVHPOL=3 GT:GQ:GQX:DP:DPF:AD:PL 0/1:121:32:20:5:11,9:123,0,199
$ vt decompose_blocksub test.vcf.gz 2> /dev/null | bcftools view -H
chr20 399074 rs386811500 A G 88 PASS SNVSB=-13.8;SNVHPOL=3;OLD_CLUMPED=chr20:399074:ACA/GCG GT:GQ:GQX:DP:DPF:AD:PL 0/1:121:32:20:5:11,9:123,0,199
chr20 399076 rs386811500 A G 88 PASS SNVSB=-13.8;SNVHPOL=3;OLD_CLUMPED=chr20:399074:ACA/GCG GT:GQ:GQX:DP:DPF:AD:PL 0/1:121:32:20:5:11,9:123,0,199
http://genome.sph.umich.edu/wiki/Vt
I've just bumped into this issue too, and thank you for the suggestion to use vt decompose_blocksub.
However I think this is also actually a bug in bcftools isec. For example when analysing this freebayes vcf file using bcftools isec it claims one variant is true and one is false.
$ zegrep $'\t'481307[0-9]$'\t' 000?.vcf
0000.vcf:1 4813074 . C T 20 . . GT:AD 0/1:20,20
0002.vcf:1 4813071 . T C 20 . . GT:AD 0/1:20,20
0003.vcf:1 4813071 . TGAC CGAT 612.577 . AB=0.45098;...
I did something akin to bcftools isec truth.vcf call.vcf. For reminder, 0000.vcf is for variants only in truth.vcf (so false negatives), 0001.vcf is variants only in call.vcf (false positive), 0002.vcf and 0003.vcf are for variants in both, but with the line taken from truth.vcf and call.vcf respectively.
Hence we see the truth file has 71 T C and 74 C T, while freebayes called 71 TGAC CGAT. Clearly freebayes is the better answer, because it contains both variants and also contains phasing between them. Isec correctly claimed that 71 T C matches 71 TGAC CGAT, but thought the 74 one was absent.
vt will correctly resolve this, permitting isec to work correctly, however I view it as a bug that isec is giving the wrong answer here. If not a bug, then at least something that needs clear documentation and a pointer to vt to patch the data up first.
Note that vt doesn't always give the desired output, either (see atks/vt#26). The example VCF is here. In terms of accuracy, bgt may be better, but it discards all INFO/FORMAT, which makes it not so useful in practice. On this particular example:
]$ ./bgt atomize -S ex2.vcf
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=11,length=135006516>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3 S4
11 101 . G C 0 . . GT 0/. ./0 0/0 0/1
11 101 . GCGT G 0 . . GT 0/1 1/. ./. ./.
11 102 . C T 0 . . GT 0/. ./0 0/1 0/0
11 104 . T A 0 . . GT 0/. ./1 1/1 1/0
or with -M
(well, nowadays it should use the *
allele instead):
$ ./bgt atomize -SM ex2.vcf
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=11,length=135006516>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3 S4
11 101 . G C,<M> 0 . . GT 0/2 2/0 0/0 0/1
11 101 . GCGT G,<M> 0 . . GT 0/1 1/2 2/2 2/2
11 102 . C T,<M> 0 . . GT 0/2 2/0 0/1 0/0
11 104 . T A,<M> 0 . . GT 0/2 2/1 1/1 1/0
Thanks Heng, although I think my current strategy of bcftools norm -m -both
piped into vt resolves this. The first command will split up the multi-allelic sites and then vt will split the mnps up.
As far as bcftools is concerned, this is why I'm adding this here - bcftools norm could add a mode that also does the mnp splitting at the same time.
@jkbonfield This bug is probably related to https://github.com/samtools/bcftools/issues/665?
MNPs and other complex events can be now decomposed using the new norm -a
option.
Note that although the code correctly transfers annotations into split records respecting the various Number=R,G,A tags, the following simplification has been made for now: currently no attempt is made to reconcile VCF annotations of identical atoms originating from different ALT alleles. This could be improved and merge logic provided, similarly to merge -l
. For example, the allelic depths (AD) should be summed when an identical atomized output allele is encountered. This level of complexity is not addressed in this initial draft. Higher priority for now is to provide the inverse "join" operation.