msgpack-python icon indicating copy to clipboard operation
msgpack-python copied to clipboard

The Unpacker fails to retrieve and unpack all the data while streaming with big data.

Open MasahiroYasumoto opened this issue 1 year ago • 2 comments

The Unpacker fails to retrieve and unpack all the data while streaming with a big data (e.g. 10GiB).

td-client-python uses msgpack-python internally to unpack the receiving data while streaming. https://github.com/treasure-data/td-client-python/blob/1.2.1/tdclient/job_api.py#L220-L244

When the size of this file is 10GiB or above, I occasionally face the problem that the Unpacker fails to retrieve and unpack all the data while streaming, which result in premature termination without raising an error.

As a workaround, I rewrote the code as follows to first receive all the data, save it to a file, and unpack it from there, which seems to have solved the problem. Thus, I suspect this is a bug in Unpacker's handling of streaming input.

with open("temp.mpack", "wb") as output_file:
    for chunk in res.stream(1024*1024*1024):
        if chunk:
            output_file.write(chunk)

with open("temp.mpack", "rb") as input_file:
    unpacker = msgpack.Unpacker(input_file, raw=False)
    for row in unpacker:
        yield row

MasahiroYasumoto avatar Jan 10 '24 04:01 MasahiroYasumoto

Unpacker can handle the file means Unpacker can handle >10GiB data. Without reproduceer, I can not fix your issue.

Maybe, res object in your code has some file-unlike behavior. (I don't know what is self.get() and what is res in your code). I recommend to use Unpacker.feed() method. You can be freed from "file-like" edge cases.

https://github.com/msgpack/msgpack-python/blob/140864249fd0f67dffaeceeb168ffe9cdf6f1964/msgpack/_unpacker.pyx#L291-L300

methane avatar Jan 10 '24 06:01 methane

Thank you for your quick response! I'll try Unpacker.feed() and see if it can fix the problem.

MasahiroYasumoto avatar Jan 11 '24 08:01 MasahiroYasumoto