consensus `--missing` not compatible with gvcf blocks
bcftools version: 1.21
Doing consensus with flag --missing will throw error on missing (./.) gvcf blocks. Here is a minimal reproducible example:
test.vcf.gz:
##fileformat=VCFv4.2
##contig=<ID=1,length=100>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1
1 3 . A . . . END=5;MIN_DP=1;AN=0 GT ./.
Inline commands to reproduce, for your convenience
printf '##fileformat=VCFv4.2\n##contig=<ID=1,length=100>\n##ALT=<ID=*,Description="Represents allele(s) other than observed.">\n##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">\n##INFO=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">\n##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">\n##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t1\n1\t3\t.\tA\t.\t.\t.\tEND=5;MIN_DP=1;AN=0\tGT\t./.' | bcftools view -Oz -o test.vcf.gz --write-index
Running command
printf ">1\nAAAAAA" | bcftools consensus --missing N test.vcf.gz
will throws error:
The fasta sequence does not match the REF allele at 1:3:
REF .vcf: [A]
ALT .vcf: [N]
REF .fa : [AAA]A
While the expected output is
>1
AANNNA
Or is there any other way to preserve missing sites when performing consensus on gvcf files?
Thank you for the issue, this is now possible. Note, however, that the change will take effect only after https://github.com/samtools/htslib/pull/1879 is merged
samtools/htslib#1879 has now been merged, but it looks like there's an out of bounds read in bcftools consensus that needs to be fixed. See https://cirrus-ci.com/task/5061858287157248
The out of bounds read was caused by stale variant type data. See samtools/htslib#1883 for a fix.
@pd3 Hi, thanks for the feature, it works well in many of my cases.
However, when I apply consensus --missing N on a reference sequence segment that partially covers a missing gvcf block, I got an error:
# using the same test file provided in this issue
printf ">1:2-4\nAAA" | bcftools consensus --missing N test.vcf.gz
Note: applying IUPAC codes based on FORMAT/GT in sample 1
>1:2-4
[E::bcf_get_variant_type] Requested allele outside valid range
While the expected result will be
>1:2-4
ANN
(The missing gvcf block range from position 3 to 5, while the reference segment range to 4, so two Ns only.)
Is it possible to make it?