bcftools annotate does not overwrite existing entries in header with -h flag
Hi, this issue concerns bcftools annotate and its optional header -h input. I'll demonstrate with an example below (all run with v1.18).
Suppose you have the following test.vcf:
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10000000>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
1 10000 . GA G 30 PASS DP=14 GT 0|0 1|0 1/1
1 20000 . T TA 30 PASS DP=11 GT 0|0 0|1 0/0
I want to update the DP field using the following annotations.tsv:
1 10000 15
1 20000 20
Then I run
bgzip annotations.tsv
tabix -s1 -b2 -e2 annotations.tsv.gz
to index the annotations. But suppose I also want to update the DP header line to have Number=A instead of Number=1. So I make annotations.hdr to be:
##INFO=<ID=DP,Number=A,Type=Integer,Description="Total Depth">
Finally, I run
bcftools annotate -a annotations.tsv.gz -h annotations.hdr -c CHROM,POS,INFO/DP -o "output.vcf.gz" test.vcf
which has the following output:
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10000000>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_annotateVersion=1.18-10-gbe695393+htslib-1.17-48-gf4a3b99
##bcftools_annotateCommand=annotate -a annotations.tsv.gz -h annotations.hdr -c CHROM,POS,INFO/DP -o output.vcf.gz test.vcf; Date=Thu Sep 14 18:12:41 2023
##bcftools_viewVersion=1.18-10-gbe695393+htslib-1.17-48-gf4a3b99
##bcftools_viewCommand=view output.vcf.gz; Date=Thu Sep 14 18:17:49 2023
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
1 10000 . GA G 30 PASS DP=15 GT 0|0 1|0 1/1
1 20000 . T TA 30 PASS DP=20 GT 0|0 0|1 0/0
So the DP fields update accordingly, but the header field is still the same as the old.
Normally, if the header field was missing, it would be added using the -h flag. Is it possible to add functionality so that the -h will also allow you to manually overwrite the existing header lines for the appropriate INFO field changed?
In general, I think bcftools reheader only allows you to update the sequence dictionary using a template. It would be nice if bcftools had a better way to manually update other header lines like in the above use case, especially if the two instances across these tools are related.
Hi, in principle the request is sound, however for this specific case I see two problems.
First, DP is a tag reserved by the VCF specification as Number=1, so it should not be redefined as Number=A. However, that could be easily solved by using a different tag. (Note that there is already the AD tag defined as Number=R, it might be best to use that.)
Second, your tsv carries no information about alleles, only the coordinate. How do you make sure that the correct number is attached to the correct allele?
Lastly, using reheader should be straightforward, why not to use the workflow as suggested by the usage page example?
# Write out the header to be modified
bcftools view -h old.bcf > header.txt
# Edit the header using your favorite text editor
vi header.txt
# Reheader the file
bcftools reheader -h header.txt -o new.bcf old.bcf
Hi, yes those are good points about the DP tag specifically, but was only using it for demonstrative purposes here to make a simple example. And yes, my annotate command was poorly defined, so could be ambiguous in general. (In my use case I also check matching REF/ALT/ID.)
I think my mistake was that I thought I had tried bcftools reheader in the past, and it only updated the sequence dictionary but not other fields. Maybe that was the case in a previous version, or a different tool. In any case, do you think it's still reasonable in most use cases to expect the -h flag in bcftools annotate to operate similarly to bcftools reheader -h?