miniz_oxide icon indicating copy to clipboard operation
miniz_oxide copied to clipboard

add derive(Serialize, Deserialize) to DecompressorOxide

Open dishmaker opened this issue 1 year ago • 1 comments

Fixes

  • https://github.com/Frommi/miniz_oxide/issues/165
  • https://github.com/bearcove/rc-zip/issues/99

I know this PR will not be merged, but it proves that state saving is actually possible!

Entire DecompressorOxide state compresses into only 356 bytes, for simple test case.

partial read len=32768
save at in_pos=11729
[miniz_oxide/tests/test_serde.rs:57:5] decompressor_state_msgpack.len() = 11824
[miniz_oxide/tests/test_serde.rs:58:5] decompressor_state_compressed.len() = 356
resume at in_pos=11729

So fast-seekable zip files are possible 😄

dishmaker avatar Feb 17 '25 14:02 dishmaker

Could you leave leave the black_box change out of this PR? I would like to evaluate that separately.

I think adding a serde feature could be fine if there is a use for it though I would prefer if it's not enabled by default. Is it also possible to only enable the rmp_serde dev dependency conditionally?

oyvindln avatar Feb 18 '25 20:02 oyvindln

Thanks for fixing the test part and splitting up the changes - will look over this and try to get it sorted when I have some time

oyvindln avatar Feb 25 '25 14:02 oyvindln

Sorry for my ignorance. There are better ways to achieve random access in zlib files. But not in zstd files.

dishmaker avatar Feb 28 '25 20:02 dishmaker

Ah okay - just took me some days to get around to this - would have been fine with adding serde support if it was useful otherwise.

oyvindln avatar Mar 05 '25 18:03 oyvindln

I'm currently working on a similar project to enable random access seekable ZIP files. I'm currently resorting maintaining a serde remote copy of the DecompressorOxide type. Would love to see this added and gated behind a feature flag to enable snapshot/restore of state.

grant-h avatar Mar 07 '25 16:03 grant-h

Coincidentally I was looking for random access to .gz files a few days ago, and this makes that possible - thanks!

(My use case is trying to quickly extract arbitrary files from an 80GB collection of downloaded .tar.gz archives, without having to reencode them in a better file format.)

But after experimenting with it, I think this is a slightly inefficient and (more importantly) fragile way to implement random access. For comparison, zlib's zran.c creates access points at the start of a deflate block. The Huffman trees etc are independent for each block, so they don't need to be serialized. You only need to store the 32KB window and the current input position (in bits), and that lets you restart decompression at that block with a DecompressorOxide::new().

Compared to binary-serializing the whole DecompressorOxide, that reduces the size of an access point from about 40KB to 32KB (uncompressed), and you can store it in a stable file format that doesn't depend on the implementation details of DecompressorOxide. That avoids concerns about versioning and compatibility if those implementation details ever change, and also means the same index file could be easily supported with another backend like libz-sys.

One downside is that you have less flexibility in exactly where the access points go, since it depends on block size. As far as I can tell, zlib compresses with pretty small blocks, but I don't know if other compressors might potentially emit giant blocks? But it's fine for the files I'm currently dealing with (the largest block is about 64KB compressed, and I want access points at least 1MB apart).

It also requires some API changes in miniz_oxide to expose the block boundaries and to resume from a non-byte-aligned offset, which means it diverges more from the original miniz. But arguably that's a smaller API change than this serde pull request, which effectively places every field of DecompressorOxide and State into the public API.

I've proposed #170 with the API changes, and I have a WIP seekable GzDecoder at https://github.com/zaynar/indexed_deflate

philiptaylor avatar Mar 17 '25 16:03 philiptaylor

The zlib format itself doesn't have any limit on block size (other than on uncompressed blocks which are limited to a size of 2^16 bytes as they have to store how long they are at the start of the length using 2 bytes). But as you have noted compressors tend to limit to how long blocks they emit due to how they are implemented. It's possible that libraries that use more sophisticated algorithms for selecting when to start a new block, zopfli especially (libdeflate might also have some more complicated logic but not sure) may end up with longer blocks so it wouldn't be completely foolproof to rely on though.

While it's not useful in your case if you have control over compressing, zlib (and miniz(oxide)) also does have flush parameters which let you start blocks at specific points as well, with or without aligning to a byte boundary and with or without "forgetting" earlier data so decompression can be resumed at that point without the prev buffer.

oyvindln avatar Mar 17 '25 19:03 oyvindln

I've tried compressing a few varied files (text, binary, random) with gzip (-6, -9), miniz_oxide (6, 9), libdeflate (-1, -6, -9, -12), and zopfli (--i1). The largest blocks I saw were:

  • gzip: 66 KiB compressed; 3.5 MiB uncompressed
  • miniz_oxide: 57 KiB compressed; 2.8 MiB uncompressed
  • libdeflate: 299,737 B compressed; 300,197 B uncompressed
  • zopfli: 662,105 B compressed; 1,000,000 B uncompressed

libdeflate says #define SOFT_MAX_BLOCK_LENGTH 300000 (uncompressed bytes).

zopfli says #define ZOPFLI_MASTER_BLOCK_SIZE 1000000 (uncompressed bytes).

Both of those are consistent with the observations, so I believe the maximum theoretical compressed block size from any of these implementations is 1MB. Random access at block boundaries still seems reasonable to me at that size. (And zopfli is so incredibly slow that surely nobody would use it for the multi-gigabyte files where random access is particularly useful.)

philiptaylor avatar Mar 18 '25 00:03 philiptaylor