htslib icon indicating copy to clipboard operation
htslib copied to clipboard

bgzf_useek and bgzidx_t scope

Open Poshi opened this issue 3 years ago • 0 comments

I've been using bgzf methods for processing some BGZ files and I found some issues with them that could be adresses with not too much effort.

My needs were to be able to slice the file in equal sized chunks, so I needed to open the file, get the uncompressed file size, divide and extract.

First, documentation. The code is the documentation. Couldn't find a proper place where the different modules were explained. Second, bgzf_useek only accepts SEEK_SET. SEEK_CUR should be trivial to implement, and SEEK_END should be easy.

For SEEK_CUR you only need a SEEK_SET to the result of a bgzf_tell() call plus the offset to that number.

For SEEK_END you need to know the lenght of the uncompressed file. That's trickier. You need to access the index, go to the last entry, position yourself there and decompress the last block. Count how many bytes had ben processed, add that to the last entry index and you have the number. From that, SEEK_END is a SEEK_SET from that number minus the offset.

If someone want to implement all of this from the outside, it needs access to the bgzidx_t and bgzidx1_t structs, which are defined in the implementation file. These structs should be moved to the header file or a set of methods to manage the index should be built.

With all of this in place, working with BGZ files should be considerably easier.

Other ideas are the automatic loading of an index when opening the file, if present. Even the automatic generation of an in-memory index if it is required (a random access is tried) and none is found/has been loaded.

Poshi avatar Feb 19 '21 12:02 Poshi