polars
polars copied to clipboard
Allow pickling of lazyframes
Problem Description
If I understand correctly, lazy frames are lightweight graph node objects(unless it is cached, in which case it also contains the arrow ChunkedArrays) that includes the dependencies of its query plan. In order to run lazy frame calculations across multiple slices (either horizontal/column-wise or vertical/row-wise) I'm trying to distribute the load in a compute cluster. There's a single driver responsible for generating these lazyframe slices (also pl.LazyFrame
s) and sends them via cloudpickle to other machines to slice.collect()
. The output would be a pl.DataFrame
and we can marshall that back via cloudpickle just fine.
Unfortunately, I can't do this right now because pickle (and subsequently cloudpickle) doesn't recognize builtin.PyLazyFrame
as a picklable object:
import pickle
# TypeError: cannot pickle 'builtins.PyLazyFrame' object
pickle.dumps(pl.DataFrame({"a": [1]}).lazy())
How difficult would it be to add pickle support for lazyframes?
Yep we should. Json should already work.
Ah TIL. I can use this for now :)
Note that you cannot serialize all graph nodes IIRC. map
calls are platform-dependent. The last time I checked they could not be serialized.
@mainrs yup - I ran into this here: https://github.com/pola-rs/polars/issues/4569