bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

feature request: convert MNPs to SNPs

Open tommycarstensen opened this issue 9 years ago • 10 comments

GATK HaplotypeCaller and samtools call these two SNPs at positions i and i+1 (i=763837) for sample NA12878: 20 763837 rs6085879 C T 0/1 20 763838 rs6085880 A G 1/1

In NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz this MNP is present: 20 763837 . CA CG,TG 1|2

vcflib vcfallelicprimitives splits the MNP into two SNPs: 20 763837 . C T 0|1 20 763838 . A G 1|1

It is the output of vcfallelicprimitives, which I desire. I am however not using vcfallelicprimitives, because it changes some of my GT fields from unknown ./. to known despite using the --keep-geno option and it fails to split triallelic SNPs such as 20:678053:G:A,T:1|2.

bcftools norm --multiallelics -both splits the NIST MNP into two MNPs: 20 763837 . CA CG 1|0 20 763837 . CA TG 0|1

vcflib vcfbreakmulti does the same: 20 763837 . CA CG 1|. 20 763837 . CA TG .|1

vt decompose does the same and prints the GT field after a git push today: 20 763837 . CA CG 1/. 20 763837 . CA TG ./1

GATK LeftAlignAndTrimVariants --splitMultiallelics also doesn't print the GT field: 20 763837 . CA TG .|. 20 763838 . A G .|.

GATK VariantsToAllelicPrimitives does not handle multiallelic variants and leaves the MNP untouched: 20 763837 . CA CG,TG 1|2

vt normalize does not seem to handle multiallelic variants (piping the vt decompose output to it also doesn't work): 20 763837 . CA CG,TG 1|2

Do you plan to expand the functionality of bcftools to allow the user to split the NIST MNP into two SNPs (similar to vcfallelicprimitives)?

I ended up writing my own code. Would it be of any help to you, if I share my Python code for doing this? Here is my quick and dirty Python script, which I wrote in a hurry, but I prefer your C code, which is faster, cleaner, more elegant and part of a larger package: https://github.com/team149/tc9/blob/master/WGS/vcf_atomize.py

tommycarstensen avatar Oct 14 '14 00:10 tommycarstensen

This would be a useful feature.

noporpoise avatar Dec 15 '15 22:12 noporpoise

Me too, I would use it to fix output from freebayes, see https://github.com/ekg/freebayes/issues/19

mmokrejs avatar Dec 22 '16 17:12 mmokrejs

I still fantasise about this feature in my dreams at night...

tommycarstensen avatar Mar 07 '17 17:03 tommycarstensen

I'd like this feature, as it is easy for me to analysis the variant and integrate into other pipelines.

ProfessionalFarmer avatar Mar 08 '17 01:03 ProfessionalFarmer

vt_decompose_blocksub can do this (I agree it would be an excellent addition to bcftools norm)

$ bcftools view -H test.vcf.gz 
chr20	399074	rs386811500	ACA	GCG	88	PASS	SNVSB=-13.8;SNVHPOL=3	GT:GQ:GQX:DP:DPF:AD:PL	0/1:121:32:20:5:11,9:123,0,199

$ vt decompose_blocksub test.vcf.gz 2> /dev/null  | bcftools view -H
chr20	399074	rs386811500	A	G	88	PASS	SNVSB=-13.8;SNVHPOL=3;OLD_CLUMPED=chr20:399074:ACA/GCG	GT:GQ:GQX:DP:DPF:AD:PL	0/1:121:32:20:5:11,9:123,0,199
chr20	399076	rs386811500	A	G	88	PASS	SNVSB=-13.8;SNVHPOL=3;OLD_CLUMPED=chr20:399074:ACA/GCG	GT:GQ:GQX:DP:DPF:AD:PL	0/1:121:32:20:5:11,9:123,0,199

http://genome.sph.umich.edu/wiki/Vt

jaredo avatar Mar 08 '17 09:03 jaredo

I've just bumped into this issue too, and thank you for the suggestion to use vt decompose_blocksub.

However I think this is also actually a bug in bcftools isec. For example when analysing this freebayes vcf file using bcftools isec it claims one variant is true and one is false.

$ zegrep $'\t'481307[0-9]$'\t' 000?.vcf
0000.vcf:1	4813074	.	C	T	20	.	.	GT:AD	0/1:20,20
0002.vcf:1	4813071	.	T	C	20	.	.	GT:AD	0/1:20,20
0003.vcf:1	4813071	.	TGAC	CGAT	612.577	.	AB=0.45098;...

I did something akin to bcftools isec truth.vcf call.vcf. For reminder, 0000.vcf is for variants only in truth.vcf (so false negatives), 0001.vcf is variants only in call.vcf (false positive), 0002.vcf and 0003.vcf are for variants in both, but with the line taken from truth.vcf and call.vcf respectively.

Hence we see the truth file has 71 T C and 74 C T, while freebayes called 71 TGAC CGAT. Clearly freebayes is the better answer, because it contains both variants and also contains phasing between them. Isec correctly claimed that 71 T C matches 71 TGAC CGAT, but thought the 74 one was absent.

vt will correctly resolve this, permitting isec to work correctly, however I view it as a bug that isec is giving the wrong answer here. If not a bug, then at least something that needs clear documentation and a pointer to vt to patch the data up first.

jkbonfield avatar Mar 01 '18 15:03 jkbonfield

Note that vt doesn't always give the desired output, either (see atks/vt#26). The example VCF is here. In terms of accuracy, bgt may be better, but it discards all INFO/FORMAT, which makes it not so useful in practice. On this particular example:

]$ ./bgt atomize -S ex2.vcf
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=11,length=135006516>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S1      S2      S3      S4
11      101     .       G       C       0       .       .       GT      0/.     ./0     0/0     0/1
11      101     .       GCGT    G       0       .       .       GT      0/1     1/.     ./.     ./.
11      102     .       C       T       0       .       .       GT      0/.     ./0     0/1     0/0
11      104     .       T       A       0       .       .       GT      0/.     ./1     1/1     1/0

or with -M (well, nowadays it should use the * allele instead):

$ ./bgt atomize -SM ex2.vcf
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=11,length=135006516>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S1      S2      S3      S4
11      101     .       G       C,<M>   0       .       .       GT      0/2     2/0     0/0     0/1
11      101     .       GCGT    G,<M>   0       .       .       GT      0/1     1/2     2/2     2/2
11      102     .       C       T,<M>   0       .       .       GT      0/2     2/0     0/1     0/0
11      104     .       T       A,<M>   0       .       .       GT      0/2     2/1     1/1     1/0

lh3 avatar Mar 01 '18 15:03 lh3

Thanks Heng, although I think my current strategy of bcftools norm -m -both piped into vt resolves this. The first command will split up the multi-allelic sites and then vt will split the mnps up.

As far as bcftools is concerned, this is why I'm adding this here - bcftools norm could add a mode that also does the mnp splitting at the same time.

jkbonfield avatar Mar 01 '18 17:03 jkbonfield

@jkbonfield This bug is probably related to https://github.com/samtools/bcftools/issues/665?

pd3 avatar Mar 02 '18 12:03 pd3

MNPs and other complex events can be now decomposed using the new norm -a option.

Note that although the code correctly transfers annotations into split records respecting the various Number=R,G,A tags, the following simplification has been made for now: currently no attempt is made to reconcile VCF annotations of identical atoms originating from different ALT alleles. This could be improved and merge logic provided, similarly to merge -l. For example, the allelic depths (AD) should be summed when an identical atomized output allele is encountered. This level of complexity is not addressed in this initial draft. Higher priority for now is to provide the inverse "join" operation.

pd3 avatar Feb 25 '21 11:02 pd3