python-snappy icon indicating copy to clipboard operation
python-snappy copied to clipboard

Use Rust implementation instead?

Open milesgranger opened this issue 5 years ago • 4 comments

Hi there!

I was curious if you'd be open to having the Rust implementation of snappy instead of the C dependency which can lead to troubles, especially when using python-snappy in environments like AWS Lambda?

I've made cramjam which does just this but would be willing to attempt migrating this project in a similar way. It wouldn't require any system dependencies and packing them in a wheel (OSX, Linux & Windows supported), of course, wouldn't require any compiler for the user. As it is right now, cramjam which includes snappy results in about ~1.5MB for linux wheels.

Anyway, let me know what you think and I'd be willing to start messing around with it. :+1:

milesgranger avatar Mar 16 '20 06:03 milesgranger

xref: https://github.com/dask/fastparquet/pull/488

I think that in general this is a good idea in general. To geta good response here, it would need to show that

  • all the tests can pass, including with the framing format (i.e., files)
  • that the performance is equivalent to current or better
  • that indeed the install size is not large and the build process simple. Note that many will be installing using conda, where the size of the compiled binary snappy is <100kb ( https://anaconda.org/conda-forge/snappy/files )
  • that you can build on all platforms

martindurant avatar Mar 16 '20 13:03 martindurant

Seems reasonable.

I see in snappy_formats.py it has hadoop_snappy and framed references as available formats. In my light reading from the snappy framing format, I can't find anything that speaks to a hadoop specification.

To my understanding, there is the raw, used for streaming, and the framed (entire in-memory streams like you mentioned) formats of snappy. Can I assume the hadoop format reference is a reference to the raw format? Those are the only two formats in the Rust implementation. If this isn't the case, then I guess there is no point in starting.

Also, could you specify what is considered a "large" install size, is ~1MB too big?

milesgranger avatar Mar 16 '20 16:03 milesgranger

Even though I may be a maintainer here, I don't actually follow the snappy specs... So long as the existing de/compresss functions and their stream counterparts srtill product identical output, I would be happy!

martindurant avatar Mar 16 '20 16:03 martindurant

Hi there, working from home has left me with less time than expected.

I've added a new commit to cramjam which supports framed and raw use of snappy compression. So if I make a new release I can confirm that it will match what python-snappy currently does for its use of compress and stream_compress.

>>> import io
>>> import snappy
>>> import cramjam
>>> data = b'hi, hello there'
>>> raw = io.BytesIO(data)
>>> output = io.BytesIO()
>>> snappy.stream_compress(raw, output)
>>> output.seek(0)
0
>>> output.read()
b'\xff\x06\x00\x00sNaPpY\x01\x13\x00\x00\x82\x8f\x01\xb8hi, hello there'
>>> cramjam.snappy_compress(data)
b'\xff\x06\x00\x00sNaPpY\x01\x13\x00\x00\x82\x8f\x01\xb8hi, hello there'
>>> snappy.compress(data)
b'\x0f8hi, hello there'
>>> cramjam.snappy_compress_raw(data)
b'\x0f8hi, hello there'
>>> 

One of my concerns with making a PR to python-snappy is it will remove a lot of existing code and there are some bits in here, like https://github.com/andrix/python-snappy/blob/602e9c10d743f71bef0bac5e4c4dffa17340d7b3/snappy/snappy.py#L67 which, to be honest, I don't know what it does :sweat_smile: or how to maintain the existing UncompressError API / situations in which it should be raised.

milesgranger avatar Apr 02 '20 08:04 milesgranger