Clarifications about streaming support
Hi, I'm developing a streaming .ZIP library that uses miniz_oxide for decompressing DEFLATE. Streaming is hard because you get the data in smaller chunks and want to decompress incrementally. Because of some peculiarities of the .ZIP format, you don't sometimes even know the size of the data beforehands. Here's my library (WIP): https://github.com/golddranks/stream_zipper
I've managed to get decompression to work when parsing files as one chunk. However, incremental decompression is giving me trouble. I'd like to ask some questions which I'd hope get eventually clarified in documentation:
-
What's the difference between
FailedCannotMakeProgressandNeedsMoreInputstates? If I handminiz_oxidea partial buffer (trygit cloningmy repo and runningcargo test test_partial_decompression -- --nocaptureand see the debug printing; it gives 400 bytes instead of the full buffer), I getFailedCannotMakeProgress. -
If I want to continue incrementally, what parameters am I supposed to pass? At the moment, I pass the rest of the input buffer (bytes after 400 EDIT: After making sure that it has indeed consumed 400 bytes up to that point!) and make a fresh output buffer, plus reuse the same
DecompressorOxidestruct, but it returns aFailure. If I pass the input buffer in one go, it manages to decompress the thing. (Edit: Is it already in a failed state afterFailedCannotMakeProgressso that's why it returnsFailure?) -
If I understand right, DEFLATE stream format has a marker when the stream ends, so knowing the size beforehands isn't required. Is this correct?
Hi!
FailedCannotMakeProgressmeans there is no way input data is valid (pending any bugs). Callingdecompressagain will continue to yield this error untilinitis called.NeedsMoreInputmeans that input stopped in the middle of the stream and subsequentdecompresscalls are needed.- Parameters seem fine, but you actually do not reuse
DecompressorOxideas there isinitevery call ofdecompress_to_vecand it is called fromdeflate. This means thatDecompressorOxidelooses all the in-between state and interpretsinput[400..]as starting of another stream at the second call. - Yes.
Hope this helps.
Hi, thanks for the answer.
First of all, about 2. indeed, that was an oversight! Silly me.
About 1. So, I'm getting this error when handing a partial buffer to the library – even when calling for the first time, so DecompressorOxide is inited correctly. If I hand the whole buffer, it manages to decompress it without problems. Is miniz_oxide supposed to be able to handle data that is truncated at an arbitrary place? Either I've got that wrong, or then I found a bug, I guess?
Actually I don't quite understand this explanation: "More input data was expected, but the caller indicated that there was more data, so the input stream is likely truncated" How is the caller indicating that there was more data? If the input stream is truncated, how is this different from just NeedsMoreInput error?
Ah! I got it finally! One is supposed to pass the TINFL_FLAG_HAS_MORE_INPUT flag in the case of incremental decompression, right?
Is miniz_oxide supposed to be able to handle data that is truncated at an arbitrary place?
Yes, it is able to handle input even byte-by-byte.
I completely forgot about that. There is a flag inflate_flags::TINFL_FLAG_HAS_MORE_INPUT that needs to be enabled when not all of the input was given. If it is not set, decompress assumes invalid data. There certainly is a need for better docs.
It seems that we posted at the same time :D I'll see if I can send some documentation PR's later today!
The first call now always succeeds with NeedsMoreInput, but the second one always ends with a failure. I made two a bit more streamlined tests test_decompression_full and test_decompression_partial here: https://github.com/golddranks/stream_zipper/blob/master/src/deflate.rs#L114
The non-partial tests succeeds whereas the partial one fails.
Can you say if I'm using the API somehow wrong?
Output buffer that is given to decompress should have the old decompressed output data to do the matches. It must span at least TINFL_LZ_DICT_SIZE back or the amount of already decompressed data, that's why it is a Cursor and not just a slice; to indicate the place where decompressor left off.
If the stream is very large it is a good idea to use wrapping buffer to avoid excessive copying.
Ah, that explains it. Thanks again.
Yup, we need to do some work on the documentation, the API is quite low-level so it can be a bit difficult to figure it out.
It may also be helpful to look at how the library is used by flate2, which wraps miniz_oxide (though the original miniz is used by default) in reader and writer structs as miniz_oxide is originally a rust port of miniz to get rid of the use of c code in flate2.