python-snappy
python-snappy copied to clipboard
Use Rust implementation instead?
Hi there!
I was curious if you'd be open to having the Rust implementation of snappy instead of the C dependency which can lead to troubles, especially when using python-snappy in environments like AWS Lambda?
I've made cramjam which does just this but would be willing to attempt migrating this project in a similar way. It wouldn't require any system dependencies and packing them in a wheel (OSX, Linux & Windows supported), of course, wouldn't require any compiler for the user. As it is right now, cramjam which includes snappy results in about ~1.5MB for linux wheels.
Anyway, let me know what you think and I'd be willing to start messing around with it. :+1:
xref: https://github.com/dask/fastparquet/pull/488
I think that in general this is a good idea in general. To geta good response here, it would need to show that
- all the tests can pass, including with the framing format (i.e., files)
- that the performance is equivalent to current or better
- that indeed the install size is not large and the build process simple. Note that many will be installing using conda, where the size of the compiled binary snappy is <100kb ( https://anaconda.org/conda-forge/snappy/files )
- that you can build on all platforms
Seems reasonable.
I see in snappy_formats.py it has hadoop_snappy and framed references as available formats. In my light reading from the snappy framing format, I can't find anything that speaks to a hadoop specification.
To my understanding, there is the raw, used for streaming, and the framed (entire in-memory streams like you mentioned) formats of snappy. Can I assume the hadoop format reference is a reference to the raw format? Those are the only two formats in the Rust implementation. If this isn't the case, then I guess there is no point in starting.
Also, could you specify what is considered a "large" install size, is ~1MB too big?
Even though I may be a maintainer here, I don't actually follow the snappy specs... So long as the existing de/compresss functions and their stream counterparts srtill product identical output, I would be happy!
Hi there, working from home has left me with less time than expected.
I've added a new commit to cramjam which supports framed and raw use of snappy compression. So if I make a new release I can confirm that it will match what python-snappy currently does for its use of compress and stream_compress.
>>> import io
>>> import snappy
>>> import cramjam
>>> data = b'hi, hello there'
>>> raw = io.BytesIO(data)
>>> output = io.BytesIO()
>>> snappy.stream_compress(raw, output)
>>> output.seek(0)
0
>>> output.read()
b'\xff\x06\x00\x00sNaPpY\x01\x13\x00\x00\x82\x8f\x01\xb8hi, hello there'
>>> cramjam.snappy_compress(data)
b'\xff\x06\x00\x00sNaPpY\x01\x13\x00\x00\x82\x8f\x01\xb8hi, hello there'
>>> snappy.compress(data)
b'\x0f8hi, hello there'
>>> cramjam.snappy_compress_raw(data)
b'\x0f8hi, hello there'
>>>
One of my concerns with making a PR to python-snappy is it will remove a lot of existing code and there are some bits in here, like https://github.com/andrix/python-snappy/blob/602e9c10d743f71bef0bac5e4c4dffa17340d7b3/snappy/snappy.py#L67 which, to be honest, I don't know what it does :sweat_smile: or how to maintain the existing UncompressError API / situations in which it should be raised.