miniz_oxide icon indicating copy to clipboard operation
miniz_oxide copied to clipboard

Clarifications about streaming support

Open golddranks opened this issue 7 years ago • 11 comments

Hi, I'm developing a streaming .ZIP library that uses miniz_oxide for decompressing DEFLATE. Streaming is hard because you get the data in smaller chunks and want to decompress incrementally. Because of some peculiarities of the .ZIP format, you don't sometimes even know the size of the data beforehands. Here's my library (WIP): https://github.com/golddranks/stream_zipper

I've managed to get decompression to work when parsing files as one chunk. However, incremental decompression is giving me trouble. I'd like to ask some questions which I'd hope get eventually clarified in documentation:

  1. What's the difference between FailedCannotMakeProgress and NeedsMoreInput states? If I hand miniz_oxide a partial buffer (try git cloning my repo and running cargo test test_partial_decompression -- --nocapture and see the debug printing; it gives 400 bytes instead of the full buffer), I get FailedCannotMakeProgress.

  2. If I want to continue incrementally, what parameters am I supposed to pass? At the moment, I pass the rest of the input buffer (bytes after 400 EDIT: After making sure that it has indeed consumed 400 bytes up to that point!) and make a fresh output buffer, plus reuse the same DecompressorOxide struct, but it returns a Failure. If I pass the input buffer in one go, it manages to decompress the thing. (Edit: Is it already in a failed state after FailedCannotMakeProgress so that's why it returns Failure?)

  3. If I understand right, DEFLATE stream format has a marker when the stream ends, so knowing the size beforehands isn't required. Is this correct?

golddranks avatar Apr 13 '18 03:04 golddranks

Hi!

  1. FailedCannotMakeProgress means there is no way input data is valid (pending any bugs). Calling decompress again will continue to yield this error until init is called. NeedsMoreInput means that input stopped in the middle of the stream and subsequent decompress calls are needed.
  2. Parameters seem fine, but you actually do not reuse DecompressorOxide as there is init every call of decompress_to_vec and it is called from deflate. This means that DecompressorOxide looses all the in-between state and interprets input[400..] as starting of another stream at the second call.
  3. Yes.

Hope this helps.

Frommi avatar Apr 13 '18 05:04 Frommi

Hi, thanks for the answer.

First of all, about 2. indeed, that was an oversight! Silly me.

About 1. So, I'm getting this error when handing a partial buffer to the library – even when calling for the first time, so DecompressorOxide is inited correctly. If I hand the whole buffer, it manages to decompress it without problems. Is miniz_oxide supposed to be able to handle data that is truncated at an arbitrary place? Either I've got that wrong, or then I found a bug, I guess?

golddranks avatar Apr 13 '18 07:04 golddranks

Actually I don't quite understand this explanation: "More input data was expected, but the caller indicated that there was more data, so the input stream is likely truncated" How is the caller indicating that there was more data? If the input stream is truncated, how is this different from just NeedsMoreInput error?

golddranks avatar Apr 13 '18 07:04 golddranks

Ah! I got it finally! One is supposed to pass the TINFL_FLAG_HAS_MORE_INPUT flag in the case of incremental decompression, right?

golddranks avatar Apr 13 '18 08:04 golddranks

Is miniz_oxide supposed to be able to handle data that is truncated at an arbitrary place?

Yes, it is able to handle input even byte-by-byte.

I completely forgot about that. There is a flag inflate_flags::TINFL_FLAG_HAS_MORE_INPUT that needs to be enabled when not all of the input was given. If it is not set, decompress assumes invalid data. There certainly is a need for better docs.

Frommi avatar Apr 13 '18 08:04 Frommi

It seems that we posted at the same time :D I'll see if I can send some documentation PR's later today!

golddranks avatar Apr 13 '18 08:04 golddranks

The first call now always succeeds with NeedsMoreInput, but the second one always ends with a failure. I made two a bit more streamlined tests test_decompression_full and test_decompression_partial here: https://github.com/golddranks/stream_zipper/blob/master/src/deflate.rs#L114

The non-partial tests succeeds whereas the partial one fails.

Can you say if I'm using the API somehow wrong?

golddranks avatar Apr 13 '18 09:04 golddranks

Output buffer that is given to decompress should have the old decompressed output data to do the matches. It must span at least TINFL_LZ_DICT_SIZE back or the amount of already decompressed data, that's why it is a Cursor and not just a slice; to indicate the place where decompressor left off.

If the stream is very large it is a good idea to use wrapping buffer to avoid excessive copying.

Frommi avatar Apr 13 '18 09:04 Frommi

Ah, that explains it. Thanks again.

golddranks avatar Apr 13 '18 10:04 golddranks

Yup, we need to do some work on the documentation, the API is quite low-level so it can be a bit difficult to figure it out.

It may also be helpful to look at how the library is used by flate2, which wraps miniz_oxide (though the original miniz is used by default) in reader and writer structs as miniz_oxide is originally a rust port of miniz to get rid of the use of c code in flate2.

oyvindln avatar Apr 13 '18 13:04 oyvindln