bcftools
bcftools copied to clipboard
bcftools view -s/-S does not recode genotypes correctly
By chance, I have noticed the following odd behavior. Here to reproduce the issue:
wget ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
gunzip Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
wget ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/Homo_sapiens.GRCh38.103.gff3.gz
(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=1,length=248956422>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSM1\tSM2"
echo -e "1\t11874\t.\tC\tT\t.\t.\t.\tGT\t0/1\t0/0/1") > in.vcf
Then the following command causes an error:
$ bcftools view --no-version -Ou -s SM1 in.vcf | bcftools csq --no-version -f Homo_sapiens.GRCh38.dna.chromosome.1.fa -g Homo_sapiens.GRCh38.103.gff3.gz -o /dev/null -v 0
bcftools: csq.c:3715: csq_stage: Assertion `ngt<=2' failed.
Aborted (core dumped)
While the following command works fine:
$ bcftools view --no-version -Ov -s SM1 in.vcf | bcftools csq --no-version -f Homo_sapiens.GRCh38.dna.chromosome.1.fa -g Homo_sapiens.GRCh38.103.gff3.gz -o /dev/null -v 0
What seems to be going on is that, even after subsetting the number of samples with bcftools view -s/-S
, the genotype array for the genotypes keeps three entries per sample even if all subset samples have two or less entries after subsetting. When recoding the binary VCF structure to text with the -Ov
option, the number of entries is lowered to two, making bcftools csq
happy again. This seems like a missed performance improvement for bcftools view -s/-S
This is not exactly related, but I have also noticed that when you merge VCFs such as the following:
(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1,length=248956422>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSM1"
echo -e "chr1\t1\t.\tC\tG\t.\t.\t.\tGT\t0/1") | \
bcftools view --no-version -Oz -o A.vcf.gz
(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1,length=248956422>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSM2") | \
bcftools view --no-version -Oz -o B.vcf.gz
(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1,length=248956422>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSM3"
echo -e "chr1\t1\t.\tC\tG\t.\t.\t.\tGT\t0/1/0"
echo -e "chr1\t2\t.\tA\tT\t.\t.\t.\tGT\t0/0/1") | \
bcftools view --no-version -Oz -o C.vcf.gz
You can get some odd outcome:
$ bcftools merge --no-version --no-index A.vcf.gz B.vcf.gz C.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248956422>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SM1 SM2 SM3
chr1 1 . C G . . . GT 0/1 ././. 0/1/0
chr1 2 . A T . . . GT ././. ././. 0/0/1
While the following is slightly different:
$ bcftools merge --no-version --no-index A.vcf.gz B.vcf.gz | bcftools merge --no-version --no-index - C.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248956422>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SM1 SM2 SM3
chr1 1 . C G . . . GT 0/1 ./. 0/1/0
chr1 2 . A T . . . GT ././. ././. 0/0/1
Some genotypes from sites missing from a VCF are set to missing diploid or missing triploid depending on the VCF that is merged. I suppose that this is a compromise as you really don't know what the ploidy of a site not encoded in a VCF is, but it does seem quite odd and makes the merging operation not associative anymore.
I can also see this as becoming an issue when you merge males and females over chromosome X if the two are encoded as haploid and diploid in the nonPAR region. I don't think there is a fix that works all the time for this but I wonder whether the appropriate behavior would be to assume a fixed ploidy with diploid by default and maybe provide an option to explicitly provide the ploidy of missing sites.
If you are wondering, the way I have noticed this is when managing triploid genotypes from the GATK Mutect2 tool.
Thank you for opening these issues.
The first is that sample subsetting in htslib does not reduce the BCF per-sample format size when the bulkiest samples are removed. I think this is a good optimization and should stay. Instead the fix should be done in bcftools csq
by adding a support for ploidy>2, see also #973. Doing this properly would be quite difficult and would require large rewrites. However, I can imagine adding a limited support, in the simplest case we could just select at maximum of two alleles from the full range of the alleles in multiploid genotypes.
The second problem is about determining the best default value for absent genotypes. The simplest solution would be to use a single dot to indicate a missing GT, without trying to infer the ploidy. Any other solution is quite complicated and would require two passes through the files to infer the ploidy from other records. Although inferring the sex and setting the ploidy correctly is doable for normal data, this approach would not work for "non-standard" data (tumors, etc) anyway.
Actually I had only used bcftools csq
to provide an example highlighting the difference in "compressing" the length of the genotype vector in a binary VCF.
I agree with your point about coding ploidy. My personal preference would be for the next VCF revision to take into account ploidy itself as a potentially missing variable and have something beyond .
and ./.
. This could be appropriate for researchers studying cancer samples with variable ploidy across the genome. I don't think inferring ploidy should be something BCFtools has to handle as conceptually is not something that can be done on a streamed VCF.
Until missing ploidy is handled in a future VCF revision, there are some questions in my mind:
- Should VCF and BCF include information about default ploidy (per cohort or per sample)?
- Should
bcftools merge
provide an option to define default ploidy at missing sites?
And, more specifically for the original comment above, is there a way for BCFtools to "re-compress" the genotype vector (and other format vectors) in a binary VCF without having to convert to a text VCF when subsetting samples?
Discussing this with others I propose the following solution:
- add a new htslib API function to re-compress FORMAT fields with unused values at the end. This will allow to easily fix programs like
csq
that refuse to work with polyploid genotypes. (However, you said this was only for demonstration of the problem; what is the real use case?) - make
merge
output single.
for missing genotypes as single dot is the standard way of representing a missing field in VCF. Anything else, like./.
, is too specific and prone to errors.
Although the default genotype could be made customizable, I am not sure if the complexity is worth it. For tumor samples it will be useless and in normal human samples, the user would need to provide sample sex and genome build to correctly assign PAR regions on chrX.
What do people think?
I discovered this behavior with my own plugin that, like csq, assumes diploid genotypes only. The VCF was generated by merging a GATK HaplotypeCaller VCF and a GATK Mutect2 VCF. I did not know how else to easily reproduce and show the behavior as this disappears as soon as you convert a binary VCF to text.
If .
is equivalent to genotype missing, and ./.
is equivalent do diploid missing (and ././.
is equivalent to triploid missing), then is there no symbol available left for haploid missing?
I would imagine that, when ngt==2
, you have three possible missing genotypes in the binary representation:
bcf_int8_vector_end, ...
bcf_int8_missing, bcf_int8_vector_end
bcf_int8_missing, bcf_int8_missing
Are the first two both encoded as .
?
VCF format is not able to distinguish between a missing haploid genotype and a missing genotype with unknown ploidy. In BCF this is possible to distinguish between the first and the second case by setting the type descriptor size bits to 0 (MISSING), see the section 6.3.3 of the specification on vectors. However, since we require interoperability of VCF and BCF, this cannot be used to distinguish these cases. (Unless the specification is changed which I don't think is going to happen.)
I have not tested how htslib handles these cases.
Creating the following toy VCF example:
(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1,length=248956422>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSM1\tSM2\tSM3"
echo -e "chr1\t1\t.\tC\tG\t.\t.\t.\tGT\t.\t./.\t././.") | \
bcftools view --no-version -Ou -o A.bcf
You can then test it contains the following genotypes:
$ bcftools view -H A.bcf | cut -f10-
. ./. ././.
And they are encoded as follows:
$ tail -c9 A.bcf | od -An -tx1 -w3
00 81 81
00 00 81
00 00 00
Where hexadecimal code 81
stands for "end of vector".
If you replace the 00 81 81
genotype with 81 81 81
, then you get the same genotypes:
$ (head -c-9 A.bcf; echo -en "\x81\x81\x81\x00\x00\x81\x00\x00\x00") | bcftools view -H | cut -f10-
. ./. ././.
And genotype 81 81 81
gets re-encoded as 00 81 81
:
$ (head -c-9 A.bcf; echo -en "\x81\x81\x81\x00\x00\x81\x00\x00\x00") | \
bcftools view --no-version | bcftools view --no-version -Ou | tail -c9 | od -An -tx1 -w3
00 81 81
00 00 81
00 00 00