noodles icon indicating copy to clipboard operation
noodles copied to clipboard

Implement region query for the reader of bgzipped indexed fasta

Open lu6huimin opened this issue 1 year ago • 3 comments

See #108, the key is to impl Seek for bgzf::Reader. My draft implementation has three steps:

  1. Rename all fn seek(&mut self, pos: VirtualPosition) to fn seek_virtual_position(&mut self, pos: VirtualPosition).
  2. impl Seek for bgzf::Reader.
  • The reader is only seekable when it has the gzi to refer. So I added two fields gzi: Option<gzi::Index> and uncompressed_position: Option<u64>, to store the index and uncompressed offset, respectively. Plus a help function bgzf::Reader::with_gzi to initialize a seekable reader.

  • The uncompressed_position is relative to the start of the uncompressed file, so it should move forward after calling fn consume. This field is different from the upos in VirtualPosition which is relative to the uncompressed data in the block. I implemented conversion from uncompressed_position to VirtualPosition by referring the answer, to make use of the fn seek_virtual_position in fn seek.

  1. Add an example of querying bgzipped indexed fasta file.

Feel free to modify the added code, or discard the changes.

BTW, thanks for building this fantastic crate.

lu6huimin avatar Aug 23 '22 08:08 lu6huimin

Sorry for the long response time, and thanks for your patience.

I don't think this is the best approach to this problem. Readers in noodles are largely agnostic to indices, and this would couple the GZ index to the BGZF reader. However, I'm still experimenting with possible solutions.

zaeleus avatar Sep 02 '22 21:09 zaeleus

It doesn't matter, thank you for investigating on it. It's my pleasure to give any help.

I agree with you, directly implementing Seek for BGZF reader is not good. It's useless to seek uncompressed position for BAM which also uses BGZF reader.

Maybe re-implementing Seek for FASTA reader is better.

ghost avatar Sep 03 '22 02:09 ghost

Sorry, opposite to my last reply, I find implementing Seek for uncompressed position maybe OK, although it's useless for most readers in noodles.

GZI (https://www.htslib.org/doc/bgzip.html#GZI_FORMAT) is agnostic to the uncompressed format (FASTA or anything else; In fact, I find nothing else using GZI right now). It's designed for seeking uncompressed position, which looks like an API for BGZF indexing.

Out of curiosity, I searched in htslib source code, finding that it also provides bgzf_useek [commit info] [bgzf.h] [bgzf.c].

ghost avatar Sep 03 '22 03:09 ghost

Thanks for your interest and possible solution. A different approach was implemented by delegating to a seekable raw reader. See {bgzf,fasta}::IndexedReader.

zaeleus avatar Nov 03 '22 16:11 zaeleus

Thanks for showing a nicer solution! I suggest a minor improvement (#129) on searching method.

ghost avatar Nov 04 '22 15:11 ghost