beam
beam copied to clipboard
Add support for out-of-band pickling (pickle 5) in PickleCoder
dev@ discussion: https://lists.apache.org/thread.html/r266f37640901544927205b913d4903340d6c59c3d94905e5dda0db42%40%3Cdev.beam.apache.org%3E
PickleCoder should support out-of-band pickling to avoid unnecessary memory copies for types that support it (including numpy and pandas types).
Imported from Jira BEAM-12418. Original Jira may contain additional context. Reported by: bhulette.
I looked into this a little bit. I don't think we can see huge performance wins here because we don't have an out-of-band path for transferring the data (e.g. shared memory). No matter what we're going to need to write/read the buffers over the Fn API in-band with the rest of the encoded object.
However as noted in PEP 574 one can still see an improvement in in-band performance with the pickle 5 protocol. It can eliminate a memcopy on the serialization path because we no longer have to materialize a full byte[] representing the serialized object (copying all the buffers) and then copy that byte[] to the output buffer.
I ran some benchmarks to confirm this, we can see that with pickle5, pickle.dump performs better when writing to a file-like object:
To take advantage of this I think we'd need to expose the OutputStream as a file-like object, and pass that to pickle.dump. This would allow pickle to write buffers directly to the output stream. Note the current behavior is to execute pickle.dumps
(one memcopy) and write the result to the output stream (second memcopy).
CCing a few folks: @robertwb @shoyer @tvalentyn @apilloud