bcftools
bcftools copied to clipboard
Add argument to iterate only through the inputfile for bcftools annotate
I suggest adding an argument to iterate only through the inputfile when using bcftools annotate
and looking up the variants in the annotation file. This would be advantageous if the input file is much smaller than the annotation file and it is expected that most variants will be annotated. A related performance issue is outlined in #1199. As a side effect, it should also allow using bcftools annotate
without indexing the input file first, which enables the use of piped output from other bcftools commands, allowing even further speedups in some workflows.
On the code side that could be implemented with a option like --iterate-only-input
and when that is set, instead of using the synced bcf_sr_next_line
the standard bcf reader can be used and the found variants can be looked up in the (indexed) annotationfile. The downside is a potential performance regression if the annotation file is smaller than the inputvcf, but if the user has to manually select this option then that should not pose any problems.
If you are willing to consider this feature, I would be happy to take a look at implementing this myself at some point in the next weeks and opening a pull request.
I am open to this feature, but must warn it will need some careful thought. The annotation file can be a tab-delimited file or a VCF/BCF, which means there are several internal branches. I started drafting a long comment how you would go about it but realized that it would be best to extend bcf_synced_reader
in htslib as well, otherwise one would end up re-implementing some of the existing functionality in bcftools.
If you want to do some explorations first, I'll be happy to give some pointers later.