spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Supporting out-of-band buffers with pickle protocol 5

Open jakirkham opened this issue 5 years ago • 3 comments

Feature description

Typically pickling in Python creates a large bytes object with types, functions, and data all packed in to allow easy reconstruction later. Originally pickling was focused on reading/writing to disk. However these days it is increasingly using as a serialization protocol for objects on the wire. In this case the copies of data required to put everything in a single bytes object hurts performance and doesn't offer much (as the data could be shipped along in separate buffers without copying).

For these reasons, Python added support for out-of-band buffers in pickle, which allows the user to flag buffers of data for pickle to extract and send alongside the typical bytes object (thus avoiding unneeded copying of data). This was submitted and accepted as PEP 574 and is part of Python 3.8 (along with a backport package for Python 3.5, 3.6, and 3.7). On the implementation side this just comes down to implementing __reduce_ex__ instead of __reduce__ (basically the same with a protocol version argument) and placing any bytes-like data (like NumPy arrays and memoryviews) into PickleBuffer objects. For older pickle protocols this step can simply be skipped. Here's an example. The rest is on libraries using protocol 5 (like Dask) to implement and use.

Could the feature be a custom component or spaCy plugin?

If so, we will tag it as project idea so other users can take it on.


I don't think so as this relies on changing the pickle implementations of spaCy objects. Though I could be wrong :)

jakirkham avatar May 21 '20 02:05 jakirkham

Should add this would only be needed on objects that have data that could be better handled out-of-band. Objects that don't own data directly themselves wouldn't need this. Also NumPy arrays already support this behavior.

jakirkham avatar May 21 '20 02:05 jakirkham

Oh thanks for explaining this! I didn't know about it. I've definitely been frustrated by Pickle before.

I think there should be a way to do this cleverly if we add support in preshed as well. I'm very keen to have this project move forward but I don't have bandwidth for it myself. I'd love for someone to take this on.

honnibal avatar May 21 '20 08:05 honnibal

Of course! It's a pretty new feature and maybe not as widely known. Know the feeling. Out-of-band pickling should help.

There are some clever ways to make the change simpler still. For example since NumPy arrays already support out-of-band pickling, if __reduce__ or __getstate__ methods return NumPy arrays (or can be tweaked to do so), things mostly just work. We had this observation with Pandas recently ( https://github.com/pandas-dev/pandas/issues/34244 ).

jakirkham avatar May 21 '20 09:05 jakirkham