bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

consensus `--missing` not compatible with gvcf blocks

Open Fan-iX opened this issue 1 year ago • 4 comments

bcftools version: 1.21 Doing consensus with flag --missing will throw error on missing (./.) gvcf blocks. Here is a minimal reproducible example:

test.vcf.gz:

##fileformat=VCFv4.2
##contig=<ID=1,length=100>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1
1       3       .       A       .       .       .       END=5;MIN_DP=1;AN=0     GT      ./.
Inline commands to reproduce, for your convenience
printf '##fileformat=VCFv4.2\n##contig=<ID=1,length=100>\n##ALT=<ID=*,Description="Represents allele(s) other than observed.">\n##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">\n##INFO=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">\n##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">\n##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t1\n1\t3\t.\tA\t.\t.\t.\tEND=5;MIN_DP=1;AN=0\tGT\t./.' | bcftools view -Oz -o test.vcf.gz --write-index

Running command

printf ">1\nAAAAAA" | bcftools consensus --missing N test.vcf.gz

will throws error:

The fasta sequence does not match the REF allele at 1:3:
   REF .vcf: [A]
   ALT .vcf: [N]
   REF .fa : [AAA]A

While the expected output is

>1
AANNNA

Or is there any other way to preserve missing sites when performing consensus on gvcf files?

Fan-iX avatar Jan 07 '25 09:01 Fan-iX

Thank you for the issue, this is now possible. Note, however, that the change will take effect only after https://github.com/samtools/htslib/pull/1879 is merged

pd3 avatar Jan 25 '25 11:01 pd3

samtools/htslib#1879 has now been merged, but it looks like there's an out of bounds read in bcftools consensus that needs to be fixed. See https://cirrus-ci.com/task/5061858287157248

daviesrob avatar Jan 27 '25 16:01 daviesrob

The out of bounds read was caused by stale variant type data. See samtools/htslib#1883 for a fix.

daviesrob avatar Jan 28 '25 11:01 daviesrob

@pd3 Hi, thanks for the feature, it works well in many of my cases.

However, when I apply consensus --missing N on a reference sequence segment that partially covers a missing gvcf block, I got an error:

# using the same test file provided in this issue
printf ">1:2-4\nAAA" | bcftools consensus --missing N test.vcf.gz
Note: applying IUPAC codes based on FORMAT/GT in sample 1
>1:2-4
[E::bcf_get_variant_type] Requested allele outside valid range

While the expected result will be

>1:2-4
ANN

(The missing gvcf block range from position 3 to 5, while the reference segment range to 4, so two Ns only.)

Is it possible to make it?

Fan-iX avatar Mar 20 '25 05:03 Fan-iX