python-zstandard decompressobj inefficiency and work-around, patch included.

Greetings!

I'm working with a network protocol which is essentially a zstandard-compressed stream of newline-delimited lines of text. A Twisted protocol for receiving this data is given a slice of compressed data which can span frames. It's simple to feed chunks of data to decompressobj.decompress until it yields uncompressed data, but it's not clear how to determine how much trailing input data didn't contribute to the uncompressed output and should be processed with a new instance of decompressobj. The only solution I've found in terms of python-zstandard 0.8.1 has been to feed data to decompressobj.decompress one byte at a time and roll to a new decompressobj each time uncompressed data is produced. This is pretty slow. If I've missed something, I'd appreciate advice.

Meanwhile, I've privately replaced decompressobj.decompress with a new function decompressobj.decompress2 and re-implemented decompressobj.decompress as a trivial wrapper around this new function. The new function returns a tuple consisting of (1) the uncompressed result and (2) the value of input.pos before the final call to ZSTD_decompressStream. Using this interface, my protocol's dataReceived method looks like this:

    def dataReceived(self, bytes):
        decompressed, remaining = self.__dobj.decompress2(bytes)
        if len(decompressed) > 0:
            self.__json_receiver.dataReceived(decompressed)
            self.__dobj = self.__dctx.decompressobj()
            if remaining > 0:
                self.dataReceived(bytes[remaining:])

If I haven't missed something in the current python-zstandard API, would you consider a solution like mine for inclusion in the next release? I'm attaching the delta to decompressobj.c for reference.

Thanks!

decompressobj.c.diff.txt

Aug 23 '17 14:08 c-wicklein

Possibly fixed a problem I encountered in testing this patch while trying to get it working with pypy...

decompressobj.c.diff.txt

Aug 23 '17 19:08 c-wicklein

That's an interesting use case. So basically the source consists of multiple zstandard frames without any extra "framing" indicating where each zstandard frame begins and ends. As a result, you want an API where you can pass data into the decompressor incrementally and then know where in the input stream you left off so you can start a fresh decompression operation at the beginning of the next frame.

This seems like a legitimate use case. I also consider it an API bug in python-zstandard that it doesn't expose when an input was partially consumed nor whether a frame boundary occurred.

We can't change the API of decompressobj.decompress() because it needs to conform to the API of Python's standard library modules. But we could expose a decompress2() (or similar) that allows the caller to know about offsets and/or frame boundaries.

I'll look at incorporating your patch (or a variation thereof) next time I sit down to hack on python-zstandard. I'm not sure when that will be. If it is a high priority for you, let me know.

Aug 29 '17 04:08 indygreg

Yes, that's exactly my use case, and decompress2() was an expedient solution. I'll probably continue to use my hack as-is while watching for an update to python-zstandard which incorporates a more permanent solution.

Thanks!

Aug 30 '17 14:08 c-wicklein

I made a little progress towards this today. See https://github.com/indygreg/python-zstandard/issues/59#issuecomment-464523170.

Feb 17 '19 23:02 indygreg

python-zstandard python-zstandard copied to clipboard

decompressobj inefficiency and work-around, patch included.

python-zstandard
python-zstandard copied to clipboard