gztool icon indicating copy to clipboard operation
gztool copied to clipboard

Feature request: zsttool

Open fiddyschmitt opened this issue 3 years ago • 6 comments

Hi Roberto,

Going out on a limb here, but do you think you can make a tool to index Zstandard files?

fiddyschmitt avatar Jan 04 '22 08:01 fiddyschmitt

In a quick review, I find that Zstandard format is also "indexable". A different thing is that I find the time to implement this 🙂 I would probably make a quick implementation in a script language first, to test the possibilities...

circulosmeos avatar Jan 15 '22 21:01 circulosmeos

Thanks for looking into it! Understand about finding time :)

fiddyschmitt avatar Jan 16 '22 00:01 fiddyschmitt

@fiddyschmitt Maybe t2sz might be something for you. It compresses to zstd in such a manner that it can be easily seeked, e.g., with ratarmount, indexed_zstd, and libzstd-seek.

mxmlnkn avatar Jul 18 '22 20:07 mxmlnkn

Awesome, thanks @mxmlnkn. That's really interesting. Do you know if t2sz can be used to create an index for an existing zst file (without having to create a new zst file)?

fiddyschmitt avatar Jul 19 '22 05:07 fiddyschmitt

Unfortunately, not.

I'm pretty sure last time I looked at the file formats, I found that it would be near impossible to do. Similar to gzip, zstd is a sequence of streams and blocks. This is btw also true for xz and lz4, I think. And while blocks are somewhat seekable, they require a back-reference window, i.e., the last x bytes from the previous decoding procedure. In contrast, streams are completely independent. This is why t2sz creates multiple streams instead of the default one stream per zstd file that the zstd standard compressor creates. But, while the back-reference windows in gzip are limited to 32 KiB, they can be as large as 2 GiB for zstd, xz, and lz4 if I remember correctly. This makes indexing near-impossible because you would have to save up to 2 GiB per checkpoint.

Maybe, an index implementation could check how large the actually required back-reference windows are. And in case, they are quite small, an index could still be created. I doubt that there are many zstd compression levels for which this is possible but that is only speculation. One mitigating factor, similar to gzip, could be uncompressed blocks inside the archive. If they are large enough, a checkpoint could be created there as the uncompressed chunks would serve as the back-reference window for all compressed blocks thereafter.

mxmlnkn avatar Jul 19 '22 07:07 mxmlnkn

Fascinating, thanks!

fiddyschmitt avatar Jul 19 '22 09:07 fiddyschmitt