polars icon indicating copy to clipboard operation
polars copied to clipboard

Allow pickling of lazyframes

Open OneRaynyDay opened this issue 1 year ago • 4 comments

Problem Description

If I understand correctly, lazy frames are lightweight graph node objects(unless it is cached, in which case it also contains the arrow ChunkedArrays) that includes the dependencies of its query plan. In order to run lazy frame calculations across multiple slices (either horizontal/column-wise or vertical/row-wise) I'm trying to distribute the load in a compute cluster. There's a single driver responsible for generating these lazyframe slices (also pl.LazyFrames) and sends them via cloudpickle to other machines to slice.collect(). The output would be a pl.DataFrame and we can marshall that back via cloudpickle just fine.

Unfortunately, I can't do this right now because pickle (and subsequently cloudpickle) doesn't recognize builtin.PyLazyFrame as a picklable object:

import pickle

# TypeError: cannot pickle 'builtins.PyLazyFrame' object
pickle.dumps(pl.DataFrame({"a": [1]}).lazy())

How difficult would it be to add pickle support for lazyframes?

OneRaynyDay avatar Aug 24 '22 20:08 OneRaynyDay

Yep we should. Json should already work.

ritchie46 avatar Aug 25 '22 07:08 ritchie46

Ah TIL. I can use this for now :)

OneRaynyDay avatar Aug 25 '22 15:08 OneRaynyDay

Note that you cannot serialize all graph nodes IIRC. map calls are platform-dependent. The last time I checked they could not be serialized.

mainrs avatar Aug 26 '22 12:08 mainrs

@mainrs yup - I ran into this here: https://github.com/pola-rs/polars/issues/4569

OneRaynyDay avatar Aug 26 '22 14:08 OneRaynyDay