bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

`bcftools norm -m +any` replaces reference allele GTs with missing for repeated sites

Open rickymagner opened this issue 1 year ago • 1 comments

This is a bit of a niche case, but came up when I was trying to create some consensus VCFs using trios. Here's a minimal example:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG002   HG003
chr1    219783256       .       AAC     ACAC,A  .       .       SYNC=219783257;BASE=FN_CA;CALL=FP_CA;AN=4;AC=1,2        GT      2/1     ./.
chr1    219783256       .       A       ACACACACACACAC  .       .       SYNC=219783257;CALL=FP_CA;AN=2;AC=1     GT      ./.     0/1

Here we have a two-sample VCF, with two sites with the same POS. The key point is the second entry shows HG003 having GT 0/1, wheras the first as missing ./.. The meanings are different: the first signifies some amount of confidence in seeing both reference and alt supporting reads, whereas the latter reflects uncertainty about how those alts relate to the sample. When running bcftools norm -m +any on this, we get the following:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG002   HG003
chr1    219783256       .       AAC     ACAC,A,ACACACACACACACAC .       .       SYNC=219783257;BASE=FN_CA;CALL=FP_CA;AN=4;AC=1,2,1      GT      2/1     ./3

Now the HG003 GT at this one site is ./3 (corresponding to ./1 with the previous order). In other words, the confidence about seeing a reference haplotype here is "lost" in the normalization. I would consider this a bug, since the tool is deciding to throw away confidence about matching the reference and replacing with some uncertainty. This is actually an important distinction, as this example shows that setting HG003 to have GT 0/3 here (matching the original) is a definitive mendelian violation compared with the 2/1 GT in the son, whereas ./3 leaves open the possibility that an MV hasn't occurred. Some tools are sensitive to this, and I'd expect the behavior involving 0/3 if I didn't catch this example. It's also worth noting if you throw away the other sample (make it a single sample VCF), the behavior is as I expected: the site collapses into a 0/1 site.

Can anyone confirm if this is expected behavior (and why), or if this is a bug? This was done with bcftools v1.20.

rickymagner avatar Aug 07 '24 13:08 rickymagner

Hi, I would also like to have more information on this behavior... Thanks

guriaregojo avatar Oct 26 '24 10:10 guriaregojo

Sorry it took longer to take longer. I just tested with the latest version we get

$ bcftools norm -m+any rmme.vcf | grep -v ^#
chr1	219783256	.	AAC	ACAC,A,ACACACACACACACAC	.	.	.	GT	2/1	0/3

Is this a satisfactory result?

pd3 avatar Oct 31 '24 14:10 pd3

This looks great, thanks! I just checked v1.21 and it looks like it has the behavior you report, so something in that release must have fixed this.

rickymagner avatar Oct 31 '24 19:10 rickymagner