Error at chrX:31709584: cannot combine diploid with haploid genotype
We use triple bcftools norm to counter the issue of SNPs and decomposed multi-allelic variants not being combined in a single norm step, resulting in some duplicate variant entires.
bcftools merge -l chunk.list -m both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
| bcftools norm --fasta-ref ${reference_fasta} -m+both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
-Ob -o ${build}_${gene}_${ensemblID}_merged_normalized_1.bcf
This fails for a few genes on chromosome X (e.g. DMD, PCDH19), specifically between the first and second norm steps, in the attempt to join previously split bi-allelic sites back into multi-allelic sites.
For DMD, this works
bcftools merge -l test_merge_list.txt -m both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
-Ob -o test_merge_norm1.bcf
This fails
bcftools merge -l test_merge_list.txt -m both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
| bcftools norm --fasta-ref ${reference_fasta} -m+both \
-Ob -o test_merge_norm2.bcf
Error at chrX:31709584: cannot combine diploid with haploid genotype
We've applied this to 100k samples, but in a small test sample of 4 (M F M F)
chrX 31707060 ./. ./. ./. 0/1 chrX 31707730 ./. ./. 1 0/1 chrX 31708369 ./. ./. ./. 0/1 chrX 31708374 ./. ./. ./. 0/1 chrX 31709555 ./. 0/1 0 0/1 chrX 31709555 ./. 0/0 1 0/0 chrX 31709584 ./. 0/1 ./. 0/0 chrX 31709584 ./. 0/0 ./. 0/1 chrX 31709584 . . 1 . chrX 31710931 . . 1 . chrX 31711087 . . 1 . chrX 31711196 . . 1 .
So, it looks like it fails where bi-allelic records of multi-allelic sites have both diploid and haploid genotypes, after the first bcftools norm -m-both
This passes if you set M and F ploidy to 2 everywhere (between 1st and 2nd norm)
bcftools merge -l test_merge_list.txt -m both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
| bcftools +fixploidy -- -s sample_sex.txt \
| bcftools norm --fasta-ref ${reference_fasta} -m+both \
-Ob -o test_merge_norm2.bcf
But it doesn’t seem right to set male non-PAR regions to ploidy 2?
Tried setting non-PAR regions to ploidy 1
bcftools merge -l test_merge_list.txt -m both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
| bcftools +fixploidy -- -p ploidy_GRCh38.txt -d 2 -s sample_sex.txt \
| bcftools norm --fasta-ref ${reference_fasta} -m+both \
-Ob -o test_merge_norm2.bcf
where the file ploidy_GRCh38.txt includes boundaries taken from PLINK 2
chrX 2781479 155701383 M 1
So in regions not specified in -p input, as far as I understand, -d 2 sets default ploidy. But, that fails with the original error.
I don't understand this well. The VCF fragment you're showing
chrX 31709584 ./. 0/1 ./. 0/0
chrX 31709584 ./. 0/0 ./. 0/1
chrX 31709584 . . 1 .
is output of this command?
bcftools norm --fasta-ref ref.fa -m-both
Can you show what is the input to this?
My best guess is that the input file already contains duplicate lines, some with haploid and some with diploid genotypes. If that is the case, this should be perhaps fixed upstream.
Hi @pd3 ,
Hope you are well.
I believe I am encountering the same issue.
Please see this minimal example.
It shows a valid multi-sample VCF with a multi-allelic SNP, that is split into its bi-allelic constituents.
##fileformat=VCFv4.1
##fileDate=20190708
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1,length=248956422>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE_ONE SAMPLE_TWO
chr1 100 . A T . . . GT 0/1 ./.
chr1 100 . A C . . . GT . 1
To join bi-allelic sites into multi-allelic records, the command...
bcftools norm -m+both example.vcf.gz
fails with:
Error at chr1:100: cannot combine diploid with haploid genotype
What I would expect from the above merge command is the following:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE_ONE SAMPLE_TWO
chr1 100 . A T,C . . . GT 0/1 2
Please let me know if you need anymore information.
Many thanks.
Hopefully this should fix it. Although still not sure how one would end up with a mixture of different ploidies.
Hi @pd3,
Many thanks for looking into this - much appreciated.
I'll explain how I ended up with a mixture of different ploidies.
Considering the two examples, with two VCFs below - performing the following operations:
merge: bcftools merge -m both (both SNP and indel records can be multiallelic)
normalise: bcftools norm -m-both -a (split multiallelic sites into biallelic records (-), both SNPs and indels should be merged separately into two records, decompose complex variants, e.g. split MNVs into consecutive SNVs)
Diploid SNP + haploid SNP
vcf1
chr1 100 . A T . . . GT 0/1
vcf2
chr1 100 . A C . . . GT 1
merge output
chr1 100 . A T,C . . . GT 0/1 2
normalise output
chr1 100 . A T . . . GT 0/1 0
chr1 100 . A C . . . GT 0/0 1
Diploid SNP + haploid MNP
vcf1
chr1 100 . A T . . . GT 0/1
vcf2
chr1 100 . AC TG . . . GT 1
merge output
chr1 100 . A T . . . GT 0/1 ./.
chr1 100 . AC TG . . . GT . 1
normalise output
chr1 100 . A T . . . GT 0/1 ./.
chr1 100 . A T . . . GT . 1
chr1 101 . C G . . . GT . 1
You can see the merge output creates the mixture of haploid and diploid ploidies. When trying to join both A/T variant lines into a single line, via bcftools norm -m+both (as explained in the previous post), the cannot combine diploid with haploid genotype occurs.
Hope this makes sense.