python-zstandard
python-zstandard copied to clipboard
python-zstandard produces incomplete data after decompress
I originally filed this against urllib3 in https://github.com/urllib3/urllib3/issues/3008 but the investigation so far is showing that urllib3 is providing the correct data to python-zstandard.
Environment info
OS Windows-10-10.0.22621-SP0 Python 3.10.11 urllib3 2.0.1
pip freeze results:
appdirs==1.4.4 certifi==2022.9.24 charset-normalizer==2.1.1 colorama==0.4.6 flake8==5.0.4 idna==3.4 mccabe==0.7.0 numpy==1.23.4 pycodestyle==2.9.1 pydiffx==1.1 pyflakes==2.5.0 python-gitlab==3.12.0 RBTools==4.0 requests==2.28.1 requests-toolbelt==0.10.1 six==1.16.0 texttable==1.6.7 tqdm==4.64.1 typing_extensions==4.4.0 urllib3==2.0.1 zstandard==0.21.0
Sample code
Example source data to decompress
I have attached a zip of text.txt.zstd because GitHub doesn't support attaching .zstd files. This file text.txt.zstd contains the raw data passed to ZstdDecoder.decompress(). Extract that text.txt.zstd file out of the zip then run:
import zstandard as zstd
print('Using zstandard backend: %s' % zstd.backend)
with open(r'C:\utils\temp\text.txt.zstd', 'rb') as f:
data = f.read()
print('decompressed %d into %d bytes' %
(len(data), len(zstd.decompress(data))))
My output is:
Using zstandard backend: cext
decompressed 2554 into 1048576 bytes
Whereas downloading zstd 1.5.5 and running zstd.exe text.txt.zstd -d -o hi.txt
produces 8724469 bytes and a valid text file.
Need to verify with supplied sample data this but knee jerk is it has something to do with zstd frames. Various APIs in this library stop reading at frame boundaries by default. In hindsight this was a bad default behavior and the release notes have called out a desire to change it so multiple frames are read transparently.
Or you found a legit issue where we fail to handle a special case where the amount of bytes read falls directly on a power of 2 boundary. 1048576 bytes decompressed is very suspicious!
Thanks for taking a look at this @indygreg, I've posted a $300 bounty on the upstream since this issue affects data integrity of urllib3 users. If you find the root cause and publish a fix for this release you can submit a claim for that bounty. Thanks again!
Current theory is that python-zstandard's DecompressionObj_decompress() function is stopping after the first frame. Sample C code that seems to match what that function is doing:
// File read code copied from stackoverflow.
std::ifstream my_file("C:\\utils\\temp\\text.txt.zstd");
my_file.seekg(0, std::ios_base::end); // Seek to end of file.
const unsigned int file_length = my_file.tellg();
my_file.seekg(0);
std::vector<char> file_data(file_length);
my_file.read(&file_data[0], file_length);
ZSTD_DStream* zds = ZSTD_createDStream();
if (zds) {
ZSTD_initDStream(zds);
ZSTD_inBuffer input = {&file_data[0], file_length, 0};
size_t outSize = ZSTD_DStreamOutSize();
std::vector<char> out_data(outSize);
std::vector<char> full_decompressed;
bool done = false;
bool error = false;
while (!done) {
auto fcs = ZSTD_getFrameContentSize(input.src, input.size); // Unused, but it returns 1048576
ZSTD_outBuffer output = { &out_data[0], outSize, 0 };
auto result = ZSTD_decompressStream(zds, &output, &input);
if (ZSTD_isError(result)) {
done = true;
error = true;
} else {
std::copy(out_data.begin(), out_data.begin() + output.pos,
std::back_inserter(full_decompressed));
if (result == 0) {
auto full_write = full_decompressed.size();
done = true;
}
}
}
ZSTD_freeDStream(zds);
}
This results in a C buffer with 1048576 bytes because ZSTD_decompressStream() is only expected to decompress one frame. I'm still reading up on the zstd APIs to try to understand what it looks like to move on to another frame.
Need to verify with supplied sample data this but knee jerk is it has something to do with zstd frames. Various APIs in this library stop reading at frame boundaries by default. In hindsight this was a bad default behavior and the release notes have called out a desire to change it so multiple frames are read transparently.
Or you found a legit issue where we fail to handle a special case where the amount of bytes read falls directly on a power of 2 boundary. 1048576 bytes decompressed is very suspicious!
Yeah, this seems to be the correct explanation. The sample data has multiple frames and python-zstandard stops after the first frame. I have two main follow-up questions:
- Is there a way now to use python-zstandard to decode multiple frames, even if the current requirement is one call to decompress() per frame? I can't find any API that tells me where the current frame ended for me to adjust the source bytestring appropriately for another call into decompress().
- Is there any timeframe for when
read_across_frames=True
will be supported?
Ok so reading through documentation and other older bug reports, it appears that I should be using ZstdCompressor().stream_reader() to get support for reading across frames. I will talk to the urllib3 maintainers about switching to that API.
Ok so reading through documentation and other older bug reports, it appears that I should be using ZstdCompressor().stream_reader() to get support for reading across frames. I will talk to the urllib3 maintainers about switching to that API.
Sorry, one typo. I meant to say that I should be using ZstdDecompressor().stream_reader().
For the ZstdDecompressionObj.decompress()
API, I consider the lack of reading across frames a bug. These decompress()
APIs are supposed to read as much as possible in a single operation. Its current behavior of stopping at frame boundaries violates that implicit contract.
I plan to change the API to decompress across frames by default. As this is technically an API breaking change, it will require a new 0.x version instead of a 0.x.y point release.
Actually, reading the docs for ZstdDecompressionObj
, we explicitly call out the single frame reading behavior:
Each instance is single use: once an input frame is decoded,
decompress()
can no longer be called.
As the existing behavior was explicitly documented, following prior conventions of this library, I'm not comfortable making a breaking change in the next 0.x release. The best we can do is add an off-by-default flag to allow ZstdDecompressionObj
instances to read across frames. Then the 2nd release from now we can make it the default.
I think the overhead for calling ZstdDecompressor.decompressobj()
should be minimal. Although there may be non-negligible cost calling ZSTD_DCtx_refDDict()
.
Would urllib3 be comfortable adapting their usage to instantiate multiple ZstdDecompressionObj
instances? i.e. how important is it to ship a new zstandard version with support for reading across frames?
We would be very comfortable adapting our usage.
By the way this is ironic because urllib3 2.0 broke users that did not read the documentation for 1.x so we know that nobody reads the documentation. Especially the documentation for ZstdDecompressionObj
which is supposed to work like the standard library. So you likely have other users that are broken without realizing it. But this is your problem to solve and again we would be very comfortable adapting our usage.
@grossag Do you have everything you need to fix this bug on the urllib3 side?
Actually, reading the docs for
ZstdDecompressionObj
, we explicitly call out the single frame reading behavior:Each instance is single use: once an input frame is decoded,
decompress()
can no longer be called.As the existing behavior was explicitly documented, following prior conventions of this library, I'm not comfortable making a breaking change in the next 0.x release. The best we can do is add an off-by-default flag to allow
ZstdDecompressionObj
instances to read across frames. Then the 2nd release from now we can make it the default.I think the overhead for calling
ZstdDecompressor.decompressobj()
should be minimal. Although there may be non-negligible cost callingZSTD_DCtx_refDDict()
.Would urllib3 be comfortable adapting their usage to instantiate multiple
ZstdDecompressionObj
instances? i.e. how important is it to ship a new zstandard version with support for reading across frames?
Thanks for the response! I have been reading through zstd and python-zstandard code for a couple days now and have some questions about the usefulness of APIs to read a single frame.
Overall, I don't understand the use of any API that only supports decoding a single frame, requires the caller to tell it how much to read, and doesn't tell you how much it read. A single frame in and of itself is not useful unless I can call the same API again to read the next frame. So I think there is a really strong argument for the one-shot API reading across frames because otherwise, I can't really see any use for this API. It doesn't tell you how much it read, so it's not like I can move forward in my source bytesarray and call the API again to read the next frame.
I would understand if there was a version of ZstdDecompressor.stream_reader()
that uses the zstd APIs to query the frame size and read only that. That function would then consume enough of the iterable to read that frame. It could then be called repeatedly until the iterable is consumed and thus the data is exhausted. But instead, ZstdDecompress.stream_reader()
requires that the caller pass the read size when it doesn't have enough information to understand how much source data is needed to construct a frame unless it starts parsing the zstd frame header block. And if it doesn't pass the read size, it defaults to something unrelated to the actual frame size, potentially consuming more data than is just needed for that frame.
@grossag Do you have everything you need to fix this bug on the urllib3 side?
It appears that I can use ZstdDecompressionReader
to get this to work in urllib3. I don't have tests working entirely yet, but I am close.
If ZstdDecompressionObj
supported reading across frames, then I could call ZstdDecompressionObj.decompress()
each time urllib3 has a chunk of data and then use the unused_data
variable when we get future chunks. But I don't think urllib3 can use this object because it doesn't support reading across frames.
The only object I can find that supports reading across frames is ZstdDecompressionReader
. I think I am able to use this to decompress multiple frames if I use an intermediate io.BytesIO()
to push response data in and have zstd use that.
Overall, I don't understand the use of any API that only supports decoding a single frame, requires the caller to tell it how much to read, and doesn't tell you how much it read.
Part of this is due to bad design in this library. In my defense, when I started this library in 2016 and when zstandard was relatively young, you rarely saw zstd payload with multiple frames. In the years since, zstandard has skyrocketed in popularity and has built out new features, such as adaptive compression. Its APIs have also evolved, making writing multiple frames a bit simpler.
Part of this is due to bad design of the (de)compression APIs in the Python standard library. The only reason that ZstdCompressionObj
and ZstdDecompressionObj
exist is to be a drop-in alternative to zlib, xz, etc formats in the Python stdlib. The stdlib decompress()
is especially poorly designed because it resizes the output buffer to fit data. This is a massive performance footgun.
The io.RawIOBase
interface (Zstd[De]CompressorReader
+ Zstd[De]CompressorWriter
, by contrast, is much more reasonable and allows the caller to control sizes explicitly and control/account for I/O + CPU [back]pressure. I really, really, really wish I could make the stdlib decompression APIs go away because they are just that bad.
Anyway, you aren't the only ones complaining about the lack of multi-frame support. It is clear we need to evolve this library to support a now common requirement in 2023. The goal posts moved and we need to adapt. This will be the major issue I focus on the next time I sit down to hack on this project.
FYI in urllib3 a PR has been merged using ZstdDecompressionObj.unused_data
to decompress one frame separately after the other.
Moving forward, I personnaly believe the read_across_frames
being implemented in all methods should do the trick for python-zstandard, but you may also want to peek at what pyzstd is doing under the hood in case they had good ideas there.
Thanks for the response @indygreg ! As @Rogdham mentioned, he was able to put together a solution that worked well in the context of the existing urllib3 design. I had put together https://github.com/urllib3/urllib3/pull/3021 but the ZstdDecompressionReader
wasn't a great fit for urllib3 because urllib3 can receive the compressed data in chunks but I needed to allocate ZstdDecompressionReader
with a io.RawIOBase
object. I had worked around it by creating an intermediate io.BytesIO
buffer, but it resulted in an extra copy that @Rogdham 's change avoids. As a result, I thought his change was better for urllib3.
So anyway, it seems like there are still many ways to use your APIs to decode data with multiple frames. unused_data
was really important in allowing for this in urllib3.
Would be great to get the one-shot API to support multiple frames but in the meantime, urllib3 has a solution to their issue. Thanks for your support!
I added support for read_across_frames=True
to ZstdDecompressionObj
in the main
branch. If you have time to test and validate with urllib3, any feedback would be appreciated.
I added support for
read_across_frames=True
toZstdDecompressionObj
in themain
branch. If you have time to test and validate with urllib3, any feedback would be appreciated.
I'm really sorry it took me so long to get to this. I gave it a test today by cloning urllib3 and python-zstandard, reverting https://github.com/urllib3/urllib3/commit/aca0f01bb6a29eda24799ec31895f45a1bb9e58b , and adding read_across_frames=True
to the decompressobj()
call.
Overall I didn't get any truncated data, but urllib3 was broken in a small way and I have a question for you about the expected behavior.
Here is the test urllib3 code:
if zstd is not None:
class ZstdDecoder(ContentDecoder):
def __init__(self) -> None:
self._obj = zstd.ZstdDecompressor().decompressobj(read_across_frames=True)
def decompress(self, data: bytes) -> bytes:
if not data:
return b""
return self._obj.decompress(data) # type: ignore[no-any-return]
def flush(self) -> bytes:
ret = self._obj.flush() # note: this is a no-op
if not self._obj.eof:
raise DecodeError("Zstandard data is incomplete")
return ret # type: ignore[no-any-return]
In my tests, flush()
raises an exception because self._obj.eof
is always False
. Is this expected with the new behavior read_across_frames=True
?
I confirmed that I had no truncation and everything worked correctly if I commented out the two lines in flush()
that check eof and raise a DecodeError
. So overall it looks good.