zindex
zindex copied to clipboard
bgzip support
Would it be possible to support files compressed with bgzip? Here's the link to source code. This would be very valuable for bioinformaticians.
Right now, here's what I get:
zindex test3.gz -v --regex '\trs([0-9]+)' --skip-first 5 --numeric --unique
Opening database test3.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 MiB
Indexing...
Progress: 18 bytes of 129.16 MiB (0.00%)
Index reading complete
Flushing
Done
Closing database
It works after I convert from bgzip to gzip:
zcat test3.gz | gzip > test4.gz
zindex test4.gz -v --regex '\trs([0-9]+)' --skip-first 5 --numeric --unique
Warning: Rebuilding existing index test4.gz.zindex
Opening database test4.gz.zindex in read-write mode
Building index, generating a checkpoint every 32.00 MiB
Indexing...
Progress: 10 bytes of 123.81 MiB (0.00%)
Progress: 85.41 MiB of 123.81 MiB (68.98%)
Index reading complete
Flushing
Done
Closing database
I'd happily accept a patch to support this file format, but without clear documentation on what the file format is, plus a good way to "fast forward" and store partial decompression information, it may be very difficult.
I'd value support for this as well; the BGZF
file format is gunzip
compatible and the specs are here. The tabix
index is published here.
Thanks for the +1. I'll see what I can do. Time for zindex
/zq
is seriously limited at the moment.
+1 for bgzip.
Just trying to understand this a bit more. It seems like:
-
BGZF
is really a sequence of compressed gzip blocks, each with extra information. The blocks are concatenated which means the compression state is not required at each block boundary (zindex
was specifically written to avoid having to do this on the source file). -
tabix
is an indexing system that understands theBZGF
file format and is able to index it and then offer random access to the blocks of the file.
I'm not quite sure how zindex
would fit into this? Perhaps someone here can share an example file and use case of queries?
At the very least zindex
should support the concatenated gzip
files (which is spec compliant), even if it doesn't use the tabix
format in any way. There might then be an option to drop the need for the compression buffers in the zindex
indices, which will make them smaller.
Ok: I now support what I believe is the bgzip format; though without understanding any of its tables etc. As bgzip is just concatenated gzip files (with extra trailer info) it should "just work". @slowkow and/or @schelhorn can you give it a go please? Again, this doesn't use or understand the tabix
part.