bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Error at chrX:31709584: cannot combine diploid with haploid genotype

Open kenhanscombe opened this issue 2 years ago • 1 comments

We use triple bcftools norm to counter the issue of SNPs and decomposed multi-allelic variants not being combined in a single norm step, resulting in some duplicate variant entires.

bcftools merge -l chunk.list -m both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
| bcftools norm --fasta-ref ${reference_fasta} -m+both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
-Ob -o ${build}_${gene}_${ensemblID}_merged_normalized_1.bcf

This fails for a few genes on chromosome X (e.g. DMD, PCDH19), specifically between the first and second norm steps, in the attempt to join previously split bi-allelic sites back into multi-allelic sites.

For DMD, this works

bcftools merge -l test_merge_list.txt -m both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
-Ob -o test_merge_norm1.bcf

This fails

bcftools merge -l test_merge_list.txt -m both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
| bcftools norm --fasta-ref ${reference_fasta} -m+both \
-Ob -o test_merge_norm2.bcf
Error at chrX:31709584: cannot combine diploid with haploid genotype

We've applied this to 100k samples, but in a small test sample of 4 (M F M F)

chrX 31707060 ./. ./. ./. 0/1 chrX 31707730 ./. ./. 1 0/1 chrX 31708369 ./. ./. ./. 0/1 chrX 31708374 ./. ./. ./. 0/1 chrX 31709555 ./. 0/1 0 0/1 chrX 31709555 ./. 0/0 1 0/0 chrX 31709584 ./. 0/1 ./. 0/0 chrX 31709584 ./. 0/0 ./. 0/1 chrX 31709584 . . 1 . chrX 31710931 . . 1 . chrX 31711087 . . 1 . chrX 31711196 . . 1 .

So, it looks like it fails where bi-allelic records of multi-allelic sites have both diploid and haploid genotypes, after the first bcftools norm -m-both

This passes if you set M and F ploidy to 2 everywhere (between 1st and 2nd norm)

bcftools merge -l test_merge_list.txt -m both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
| bcftools +fixploidy -- -s sample_sex.txt \
| bcftools norm --fasta-ref ${reference_fasta} -m+both \
-Ob -o test_merge_norm2.bcf

But it doesn’t seem right to set male non-PAR regions to ploidy 2?


Tried setting non-PAR regions to ploidy 1

bcftools merge -l test_merge_list.txt -m both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
| bcftools +fixploidy -- -p ploidy_GRCh38.txt -d 2 -s sample_sex.txt \
| bcftools norm --fasta-ref ${reference_fasta} -m+both \
-Ob -o test_merge_norm2.bcf

where the file ploidy_GRCh38.txt includes boundaries taken from PLINK 2

chrX 2781479 155701383 M 1

So in regions not specified in -p input, as far as I understand, -d 2 sets default ploidy. But, that fails with the original error.

kenhanscombe avatar Jul 17 '23 09:07 kenhanscombe

I don't understand this well. The VCF fragment you're showing

chrX 31709584 ./. 0/1 ./. 0/0
chrX 31709584 ./. 0/0 ./. 0/1
chrX 31709584 . . 1 .

is output of this command?

bcftools norm --fasta-ref ref.fa -m-both

Can you show what is the input to this?

My best guess is that the input file already contains duplicate lines, some with haploid and some with diploid genotypes. If that is the case, this should be perhaps fixed upstream.

pd3 avatar Aug 24 '23 11:08 pd3

Hi @pd3 ,

Hope you are well.

I believe I am encountering the same issue.

Please see this minimal example.

It shows a valid multi-sample VCF with a multi-allelic SNP, that is split into its bi-allelic constituents.

##fileformat=VCFv4.1									
##fileDate=20190708	
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1,length=248956422>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE_ONE	SAMPLE_TWO
chr1	100	.	A	T	.	.	.	GT	0/1	./.
chr1	100	.	A	C	.	.	.	GT	.	1

To join bi-allelic sites into multi-allelic records, the command...

bcftools norm -m+both example.vcf.gz

fails with:

Error at chr1:100: cannot combine diploid with haploid genotype

What I would expect from the above merge command is the following:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE_ONE	SAMPLE_TWO
chr1	100	.	A	T,C	.	.	.	GT	0/1	2

Please let me know if you need anymore information.

Many thanks.

chrisodhams avatar May 01 '24 16:05 chrisodhams

Hopefully this should fix it. Although still not sure how one would end up with a mixture of different ploidies.

pd3 avatar May 05 '24 18:05 pd3

Hi @pd3,

Many thanks for looking into this - much appreciated.

I'll explain how I ended up with a mixture of different ploidies.

Considering the two examples, with two VCFs below - performing the following operations:

merge: bcftools merge -m both (both SNP and indel records can be multiallelic)

normalise: bcftools norm -m-both -a (split multiallelic sites into biallelic records (-), both SNPs and indels should be merged separately into two records, decompose complex variants, e.g. split MNVs into consecutive SNVs)

Diploid SNP + haploid SNP

vcf1
chr1	100	.	A	T	.	.	.	GT	0/1

vcf2  
chr1	100	.	A	C	.	.	.	GT	1

merge output
chr1	100	.	A	T,C	.	.	.	GT	0/1	2

normalise output
chr1	100	.	A	T	.	.	.	GT	0/1	0
chr1	100	.	A	C	.	.	.	GT	0/0	1

Diploid SNP + haploid MNP

vcf1
chr1	100	.	A	T	.	.	.	GT	0/1

vcf2
chr1	100	.	AC	TG	.	.	.	GT	1

merge output
chr1	100	.	A	T	.	.	.	GT	0/1	./.
chr1	100	.	AC	TG	.	.	.	GT	.	1

normalise output
chr1	100	.	A	T	.	.	.	GT	0/1	./.
chr1	100	.	A	T	.	.	.	GT	.	1
chr1	101	.	C	G	.	.	.	GT	.	1

You can see the merge output creates the mixture of haploid and diploid ploidies. When trying to join both A/T variant lines into a single line, via bcftools norm -m+both (as explained in the previous post), the cannot combine diploid with haploid genotype occurs.

Hope this makes sense.

chrisodhams avatar May 06 '24 12:05 chrisodhams