beam icon indicating copy to clipboard operation
beam copied to clipboard

Add support for out-of-band pickling (pickle 5) in PickleCoder

Open damccorm opened this issue 2 years ago • 1 comments

dev@ discussion: https://lists.apache.org/thread.html/r266f37640901544927205b913d4903340d6c59c3d94905e5dda0db42%40%3Cdev.beam.apache.org%3E

PickleCoder should support out-of-band pickling to avoid unnecessary memory copies for types that support it (including numpy and pandas types).

Imported from Jira BEAM-12418. Original Jira may contain additional context. Reported by: bhulette.

damccorm avatar Jun 04 '22 20:06 damccorm

I looked into this a little bit. I don't think we can see huge performance wins here because we don't have an out-of-band path for transferring the data (e.g. shared memory). No matter what we're going to need to write/read the buffers over the Fn API in-band with the rest of the encoded object.

However as noted in PEP 574 one can still see an improvement in in-band performance with the pickle 5 protocol. It can eliminate a memcopy on the serialization path because we no longer have to materialize a full byte[] representing the serialized object (copying all the buffers) and then copy that byte[] to the output buffer.

I ran some benchmarks to confirm this, we can see that with pickle5, pickle.dump performs better when writing to a file-like object: image

To take advantage of this I think we'd need to expose the OutputStream as a file-like object, and pass that to pickle.dump. This would allow pickle to write buffers directly to the output stream. Note the current behavior is to execute pickle.dumps (one memcopy) and write the result to the output stream (second memcopy).

CCing a few folks: @robertwb @shoyer @tvalentyn @apilloud

TheNeuralBit avatar Sep 19 '22 23:09 TheNeuralBit