python-snappy Use Rust implementation instead?

Hi there!

I was curious if you'd be open to having the Rust implementation of snappy instead of the C dependency which can lead to troubles, especially when using python-snappy in environments like AWS Lambda?

I've made cramjam which does just this but would be willing to attempt migrating this project in a similar way. It wouldn't require any system dependencies and packing them in a wheel (OSX, Linux & Windows supported), of course, wouldn't require any compiler for the user. As it is right now, cramjam which includes snappy results in about ~1.5MB for linux wheels.

Anyway, let me know what you think and I'd be willing to start messing around with it. :+1:

Mar 16 '20 06:03 milesgranger

xref: https://github.com/dask/fastparquet/pull/488

I think that in general this is a good idea in general. To geta good response here, it would need to show that

all the tests can pass, including with the framing format (i.e., files)
that the performance is equivalent to current or better
that indeed the install size is not large and the build process simple. Note that many will be installing using conda, where the size of the compiled binary snappy is <100kb ( https://anaconda.org/conda-forge/snappy/files )
that you can build on all platforms

Mar 16 '20 13:03 martindurant

Seems reasonable.

I see in snappy_formats.py it has hadoop_snappy and framed references as available formats. In my light reading from the snappy framing format, I can't find anything that speaks to a hadoop specification.

To my understanding, there is the raw, used for streaming, and the framed (entire in-memory streams like you mentioned) formats of snappy. Can I assume the hadoop format reference is a reference to the raw format? Those are the only two formats in the Rust implementation. If this isn't the case, then I guess there is no point in starting.

Also, could you specify what is considered a "large" install size, is ~1MB too big?

Mar 16 '20 16:03 milesgranger

Even though I may be a maintainer here, I don't actually follow the snappy specs... So long as the existing de/compresss functions and their stream counterparts srtill product identical output, I would be happy!

Mar 16 '20 16:03 martindurant

Hi there, working from home has left me with less time than expected.

I've added a new commit to cramjam which supports framed and raw use of snappy compression. So if I make a new release I can confirm that it will match what python-snappy currently does for its use of compress and stream_compress.

>>> import io
>>> import snappy
>>> import cramjam
>>> data = b'hi, hello there'
>>> raw = io.BytesIO(data)
>>> output = io.BytesIO()
>>> snappy.stream_compress(raw, output)
>>> output.seek(0)
0
>>> output.read()
b'\xff\x06\x00\x00sNaPpY\x01\x13\x00\x00\x82\x8f\x01\xb8hi, hello there'
>>> cramjam.snappy_compress(data)
b'\xff\x06\x00\x00sNaPpY\x01\x13\x00\x00\x82\x8f\x01\xb8hi, hello there'
>>> snappy.compress(data)
b'\x0f8hi, hello there'
>>> cramjam.snappy_compress_raw(data)
b'\x0f8hi, hello there'
>>>

One of my concerns with making a PR to python-snappy is it will remove a lot of existing code and there are some bits in here, like https://github.com/andrix/python-snappy/blob/602e9c10d743f71bef0bac5e4c4dffa17340d7b3/snappy/snappy.py#L67 which, to be honest, I don't know what it does :sweat_smile: or how to maintain the existing UncompressError API / situations in which it should be raised.

Apr 02 '20 08:04 milesgranger

python-snappy python-snappy copied to clipboard

Use Rust implementation instead?

python-snappy
python-snappy copied to clipboard