bcftools bcftools view -s/-S does not recode genotypes correctly

bcftools view -s/-S does not recode genotypes correctly

Open freeseek opened this issue 3 years ago • 7 comments

By chance, I have noticed the following odd behavior. Here to reproduce the issue:

wget ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
gunzip Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
wget ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/Homo_sapiens.GRCh38.103.gff3.gz
(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=1,length=248956422>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSM1\tSM2"
echo -e "1\t11874\t.\tC\tT\t.\t.\t.\tGT\t0/1\t0/0/1") > in.vcf

Then the following command causes an error:

$ bcftools view --no-version -Ou -s SM1 in.vcf | bcftools csq --no-version -f Homo_sapiens.GRCh38.dna.chromosome.1.fa -g Homo_sapiens.GRCh38.103.gff3.gz -o /dev/null -v 0
bcftools: csq.c:3715: csq_stage: Assertion `ngt<=2' failed.
Aborted (core dumped)

While the following command works fine:

$ bcftools view --no-version -Ov -s SM1 in.vcf | bcftools csq --no-version -f Homo_sapiens.GRCh38.dna.chromosome.1.fa -g Homo_sapiens.GRCh38.103.gff3.gz -o /dev/null -v 0

What seems to be going on is that, even after subsetting the number of samples with bcftools view -s/-S, the genotype array for the genotypes keeps three entries per sample even if all subset samples have two or less entries after subsetting. When recoding the binary VCF structure to text with the -Ov option, the number of entries is lowered to two, making bcftools csq happy again. This seems like a missed performance improvement for bcftools view -s/-S

Apr 21 '21 16:04 freeseek

This is not exactly related, but I have also noticed that when you merge VCFs such as the following:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1,length=248956422>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSM1"
echo -e "chr1\t1\t.\tC\tG\t.\t.\t.\tGT\t0/1") | \
  bcftools view --no-version -Oz -o A.vcf.gz

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1,length=248956422>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSM2") | \
  bcftools view --no-version -Oz -o B.vcf.gz

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1,length=248956422>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSM3"
echo -e "chr1\t1\t.\tC\tG\t.\t.\t.\tGT\t0/1/0"
echo -e "chr1\t2\t.\tA\tT\t.\t.\t.\tGT\t0/0/1") | \
  bcftools view --no-version -Oz -o C.vcf.gz

You can get some odd outcome:

$ bcftools merge --no-version --no-index A.vcf.gz B.vcf.gz C.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248956422>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SM1	SM2	SM3
chr1	1	.	C	G	.	.	.	GT	0/1	././.	0/1/0
chr1	2	.	A	T	.	.	.	GT	././.	././.	0/0/1

While the following is slightly different:

$ bcftools merge --no-version --no-index A.vcf.gz B.vcf.gz | bcftools merge --no-version --no-index - C.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248956422>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SM1	SM2	SM3
chr1	1	.	C	G	.	.	.	GT	0/1	./.	0/1/0
chr1	2	.	A	T	.	.	.	GT	././.	././.	0/0/1

Some genotypes from sites missing from a VCF are set to missing diploid or missing triploid depending on the VCF that is merged. I suppose that this is a compromise as you really don't know what the ploidy of a site not encoded in a VCF is, but it does seem quite odd and makes the merging operation not associative anymore.

I can also see this as becoming an issue when you merge males and females over chromosome X if the two are encoded as haploid and diploid in the nonPAR region. I don't think there is a fix that works all the time for this but I wonder whether the appropriate behavior would be to assume a fixed ploidy with diploid by default and maybe provide an option to explicitly provide the ploidy of missing sites.

If you are wondering, the way I have noticed this is when managing triploid genotypes from the GATK Mutect2 tool.

Apr 21 '21 22:04 freeseek

Thank you for opening these issues.

The first is that sample subsetting in htslib does not reduce the BCF per-sample format size when the bulkiest samples are removed. I think this is a good optimization and should stay. Instead the fix should be done in bcftools csq by adding a support for ploidy>2, see also #973. Doing this properly would be quite difficult and would require large rewrites. However, I can imagine adding a limited support, in the simplest case we could just select at maximum of two alleles from the full range of the alleles in multiploid genotypes.

The second problem is about determining the best default value for absent genotypes. The simplest solution would be to use a single dot to indicate a missing GT, without trying to infer the ploidy. Any other solution is quite complicated and would require two passes through the files to infer the ploidy from other records. Although inferring the sex and setting the ploidy correctly is doable for normal data, this approach would not work for "non-standard" data (tumors, etc) anyway.

Apr 26 '21 08:04 pd3

Actually I had only used bcftools csq to provide an example highlighting the difference in "compressing" the length of the genotype vector in a binary VCF.

I agree with your point about coding ploidy. My personal preference would be for the next VCF revision to take into account ploidy itself as a potentially missing variable and have something beyond . and ./.. This could be appropriate for researchers studying cancer samples with variable ploidy across the genome. I don't think inferring ploidy should be something BCFtools has to handle as conceptually is not something that can be done on a streamed VCF.

Until missing ploidy is handled in a future VCF revision, there are some questions in my mind:

Should VCF and BCF include information about default ploidy (per cohort or per sample)?
Should bcftools merge provide an option to define default ploidy at missing sites?

And, more specifically for the original comment above, is there a way for BCFtools to "re-compress" the genotype vector (and other format vectors) in a binary VCF without having to convert to a text VCF when subsetting samples?

Apr 26 '21 14:04 freeseek

Discussing this with others I propose the following solution:

add a new htslib API function to re-compress FORMAT fields with unused values at the end. This will allow to easily fix programs like csq that refuse to work with polyploid genotypes. (However, you said this was only for demonstration of the problem; what is the real use case?)
make merge output single . for missing genotypes as single dot is the standard way of representing a missing field in VCF. Anything else, like ./., is too specific and prone to errors.

Although the default genotype could be made customizable, I am not sure if the complexity is worth it. For tumor samples it will be useless and in normal human samples, the user would need to provide sample sex and genome build to correctly assign PAR regions on chrX.

What do people think?

Apr 27 '21 14:04 pd3

I discovered this behavior with my own plugin that, like csq, assumes diploid genotypes only. The VCF was generated by merging a GATK HaplotypeCaller VCF and a GATK Mutect2 VCF. I did not know how else to easily reproduce and show the behavior as this disappears as soon as you convert a binary VCF to text.

If . is equivalent to genotype missing, and ./. is equivalent do diploid missing (and ././. is equivalent to triploid missing), then is there no symbol available left for haploid missing?

I would imagine that, when ngt==2, you have three possible missing genotypes in the binary representation:

bcf_int8_vector_end, ...
bcf_int8_missing, bcf_int8_vector_end
bcf_int8_missing, bcf_int8_missing

Are the first two both encoded as .?

Apr 27 '21 15:04 freeseek

VCF format is not able to distinguish between a missing haploid genotype and a missing genotype with unknown ploidy. In BCF this is possible to distinguish between the first and the second case by setting the type descriptor size bits to 0 (MISSING), see the section 6.3.3 of the specification on vectors. However, since we require interoperability of VCF and BCF, this cannot be used to distinguish these cases. (Unless the specification is changed which I don't think is going to happen.)

I have not tested how htslib handles these cases.

Apr 27 '21 15:04 pd3

Creating the following toy VCF example:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1,length=248956422>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSM1\tSM2\tSM3"
echo -e "chr1\t1\t.\tC\tG\t.\t.\t.\tGT\t.\t./.\t././.") | \
  bcftools view --no-version -Ou -o A.bcf

You can then test it contains the following genotypes:

$ bcftools view -H A.bcf | cut -f10-
.	./.	././.

And they are encoded as follows:

$ tail -c9 A.bcf | od -An -tx1 -w3
 00 81 81
 00 00 81
 00 00 00

Where hexadecimal code 81 stands for "end of vector".

If you replace the 00 81 81 genotype with 81 81 81, then you get the same genotypes:

$ (head -c-9 A.bcf; echo -en "\x81\x81\x81\x00\x00\x81\x00\x00\x00") | bcftools view -H | cut -f10-
.	./.	././.

And genotype 81 81 81 gets re-encoded as 00 81 81:

$ (head -c-9 A.bcf; echo -en "\x81\x81\x81\x00\x00\x81\x00\x00\x00") | \
  bcftools view --no-version | bcftools view --no-version -Ou | tail -c9 | od -An -tx1 -w3
 00 81 81
 00 00 81
 00 00 00

Apr 27 '21 18:04 freeseek

bcftools bcftools copied to clipboard

bcftools view -s/-S does not recode genotypes correctly

bcftools
bcftools copied to clipboard