bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Add argument to iterate only through the inputfile for bcftools annotate

Open Balthasar-eu opened this issue 1 year ago • 1 comments

I suggest adding an argument to iterate only through the inputfile when using bcftools annotate and looking up the variants in the annotation file. This would be advantageous if the input file is much smaller than the annotation file and it is expected that most variants will be annotated. A related performance issue is outlined in #1199. As a side effect, it should also allow using bcftools annotate without indexing the input file first, which enables the use of piped output from other bcftools commands, allowing even further speedups in some workflows.

On the code side that could be implemented with a option like --iterate-only-input and when that is set, instead of using the synced bcf_sr_next_line the standard bcf reader can be used and the found variants can be looked up in the (indexed) annotationfile. The downside is a potential performance regression if the annotation file is smaller than the inputvcf, but if the user has to manually select this option then that should not pose any problems.

If you are willing to consider this feature, I would be happy to take a look at implementing this myself at some point in the next weeks and opening a pull request.

Balthasar-eu avatar Jun 02 '23 01:06 Balthasar-eu

I am open to this feature, but must warn it will need some careful thought. The annotation file can be a tab-delimited file or a VCF/BCF, which means there are several internal branches. I started drafting a long comment how you would go about it but realized that it would be best to extend bcf_synced_reader in htslib as well, otherwise one would end up re-implementing some of the existing functionality in bcftools.

If you want to do some explorations first, I'll be happy to give some pointers later.

pd3 avatar Jun 05 '23 08:06 pd3