zindex icon indicating copy to clipboard operation
zindex copied to clipboard

bgzip support

Open slowkow opened this issue 9 years ago • 6 comments

Would it be possible to support files compressed with bgzip? Here's the link to source code. This would be very valuable for bioinformaticians.

Right now, here's what I get:

zindex test3.gz -v --regex '\trs([0-9]+)' --skip-first 5 --numeric --unique

Opening database test3.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 MiB
Indexing...
Progress: 18 bytes of 129.16 MiB (0.00%)
Index reading complete
Flushing
Done
Closing database

It works after I convert from bgzip to gzip:

zcat test3.gz | gzip > test4.gz
zindex test4.gz -v --regex '\trs([0-9]+)' --skip-first 5 --numeric --unique

Warning: Rebuilding existing index test4.gz.zindex
Opening database test4.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 MiB
Indexing...
Progress: 10 bytes of 123.81 MiB (0.00%)
Progress: 85.41 MiB of 123.81 MiB (68.98%)
Index reading complete
Flushing
Done
Closing database

slowkow avatar Dec 03 '15 20:12 slowkow

I'd happily accept a patch to support this file format, but without clear documentation on what the file format is, plus a good way to "fast forward" and store partial decompression information, it may be very difficult.

mattgodbolt avatar Mar 31 '16 22:03 mattgodbolt

I'd value support for this as well; the BGZF file format is gunzip compatible and the specs are here. The tabix index is published here.

schelhorn avatar Dec 15 '16 12:12 schelhorn

Thanks for the +1. I'll see what I can do. Time for zindex/zq is seriously limited at the moment.

mattgodbolt avatar Dec 15 '16 13:12 mattgodbolt

+1 for bgzip.

lonphan avatar May 31 '17 17:05 lonphan

Just trying to understand this a bit more. It seems like:

  • BGZF is really a sequence of compressed gzip blocks, each with extra information. The blocks are concatenated which means the compression state is not required at each block boundary (zindex was specifically written to avoid having to do this on the source file).
  • tabix is an indexing system that understands the BZGF file format and is able to index it and then offer random access to the blocks of the file.

I'm not quite sure how zindex would fit into this? Perhaps someone here can share an example file and use case of queries?

At the very least zindex should support the concatenated gzip files (which is spec compliant), even if it doesn't use the tabix format in any way. There might then be an option to drop the need for the compression buffers in the zindex indices, which will make them smaller.

mattgodbolt avatar Jun 05 '17 23:06 mattgodbolt

Ok: I now support what I believe is the bgzip format; though without understanding any of its tables etc. As bgzip is just concatenated gzip files (with extra trailer info) it should "just work". @slowkow and/or @schelhorn can you give it a go please? Again, this doesn't use or understand the tabix part.

mattgodbolt avatar Jun 09 '17 21:06 mattgodbolt