htslib hts_idx_t file query API

htslib appears to lack a clean index querying API. The internals of hts_idx_t are not publicly visible, and there does not appear to be a function to find the appropriate position to seek to.

Looking at the source code, the following options appear to be available:

create a hts_itr_t using hts_itr_query() then peek in hts_itr_t.curr_off. Unfortunately, This structure should be considered opaque by end users.
Wrap a call to hts_itr_multi_next() with no-op callbacks except for hts_seek_func that saves the seek position that can then be returned. The mock htsFile cannot be NULL nor a cram file due to the internal cram_set_option calls within hts_itr_multi_next.

Other approaches such as handing over ownership of your FP to BGZF, call hts_itr_next() with a no-op hts_readrec_func, letting the seek happen and querying fp->seeked mihgt work but I want to handle the compression myself.

What is the 'correct' way of querying an index for the file position(s)/internals(s) that a genomic interval/position(s) could be located in?

Apr 02 '25 06:04 d-cameron

Is this another edition of #385 and #845?

Possibly what you need is the hts_itr_next_chunk() function proposed in this exp/raw-index branch. Though I expect it needs updating to deal with multi-iterators.

Apr 02 '25 07:04 jmarshall

Yes, the essence of the request is the same as those.

Apr 02 '25 07:04 d-cameron

It does really need a big shake up and overhaul, but it's a non trivial task. As John says, I think the multi-region iterator may be a better place to start. For [BV]CF there's also the synced reader code, which is mainly about on-the-fly merging but is I think it's also sometimes used for iterator queries on a single file.

I've been playing with indexes a bit recently and seriously how did we ever get in such a mess!? They're all over the shop. Even the apparent same index format can come in multiple incompatible variants. Eg CSI on BCF holds information on all chromosomes listed in the header irrespective of coverage, while CSI on VCF holds information only on covered records even if there is full information in the header. So even though it's the same basic indexing structure, we have two parallel sets of function calls for building, querying and iterating as the chr-name to "tid" conversion function differs and it bubbles all the way up the chain. This all appears to be an invention serving no real purpose other than the frustrate the user. (I suspect it all stems from the decision to make contig lines optional, but optional doesn't mean you should't use them when they're present, so IMO the distinction was a big mistake.)

Add to this the fact that CRAM is a totally different beast with a completely different indexing strategy, and the fact that most of the world uses tabix for indexing VCF but not for indexing SAM. (Although I suspect I'm the only one who's ever tried to do random access on SAM instead of BAM/CRAM. I added it to work around the long chromosome issues.)

What I think we need are two levels of indexing too.

Mapping genomic coordinates to uncompressed file coordinates: bai, csi, tbi, crai
Mapping uncompressed file coordinates to compressed file coordinates (compressed block start offset & uncompressed offset into block): gzi, plus those above.

These serve different purposes, but both are needed. The lack of the second is why PacBio invented yet another index format for parallel processing of their unaligned BAMs. This isn't just an index formatting thing, but also an API issue.

I've also suffered issues with indexing being tied to a file handle (cram). There are times I want multiple threads working on separate parts of the file. Due to the way hfile works with the buffer, this does need opening the file multiple times, which in turn needs reading the index multiple times. Ideally we'd be able to separate those apart somehow. A shared file descriptor with per-thread read-ahead buffers and shared index could work, although potentially reopening file descriptor may still win out for the OS also doing read-ahead optimisation and removal of seeks. Eg see https://github.com/samtools/samtools/blob/develop/bam_consensus.c#L2598-L2614 for an example of the current strategy when implementing parallel consensus calling, where each thread in the thread pool gets its own open file handle and (sometimes) index to play with.

Finally one other long-standing "to do" item on my list is fixing the region query to be able to do "chr:X-Y" vs "chr:>X-Y" or similar, with the latter being items with start coordinate >X and <= Y. The former is for overlaps, which means a series of successive coordinates will duplicate items, but the latter explicitly removes duplicates. It can be done already by adding a filter expression, but this is slow. The trivial query change would make it much simpler to implement parallel processing.

Apr 02 '25 08:04 jkbonfield

Can you please explain what is the problem you are trying to solve here? It's likely an updated API that John mentioned is what's needed, but I'm not sure. (Ie what's the problem, rather than this is the solution I'd like)

Note this may also be extremely specific to BGZF (virtual offsets vs absolute offsets), and possibly even then to the file format being compressed, so the details do matter.

Apr 03 '25 11:04 jkbonfield