polars
polars copied to clipboard
JSON serialization of UDF's
Problem Description
In python polars, we can specify any user defined functions via lazyframe.map, and there are some operations in polars that are represented with python functions. We can also serialize lazyframe plans using lazyframe.write_json(), but it currently doesn't support UDF's:
ValueError: Error("the enum variant LogicalPlan::Udf cannot be serialized", line: 0, column: 0)
Could we potentially pickle the python function specified and put it in the JSON? I can't think of a better way to allow for arbitrary python functions to be described in the serialized plan.
Don't start with the hardest ;)
Pretty sure I talked with one of the maintainers on SO about this. The logical plan serialization is not fallible but skips fields that cannot be represented 1-to-1 on different runtimes. Meaning that if you serialize a Python function to JSON but deserialize the logical plan in Rust, you would hit a wall. There is no way to convert functions from Python to Rust in a generic way. The same holds true for JavaScript.
https://github.com/pola-rs/polars/blob/7ef502b170b4be592cc846412d965599608274ee/polars/polars-lazy/src/logical_plan/mod.rs#L177-L183
Pickling of functions simply stores the function name (including module name) and checks that this name resolves to the same object (is) as the one being pickled. If they don't match or if the name cannot be resolved (e.g. it includes a <locals> component in its kame from being defined in another function), pickle raises an exception. All that to say that it probably wouldn't be too hard to replicate by just storing a string in JSON without involving pickle itself, but that would never work for something like a lambda.
Projects like cloudpickle attempt to close that gap by serializing the bytecode but it's a whole can of worm and has limitations, so I wouldn't want to re-expose those as part of a public API for a lib like polars.
I suppose one (costly) way to handle almost fully generic serialization would be to reduce a node in the plan to a DataFrame in case that node cannot be serialized. That would not be perfect as this would require being able to successfully run (e.g. wouldn't work if an input parquet file has disappeared) but probably enough for lots of uses.
This has already been implemented. If cloudpickle is available, Polars uses that. And otherwise falls back to pickle.
Just for completeness. What I ended up doing is to serialize the logical plan as JSON. I wrapped the JSON into a custom struct that also contains metadata about the runtime environment of the original code that created the logical plan. In Python that resolves to adding the source code of custom functions as well as Python package names.
From the Rust side, I am able to de-serialize the logical plan, setup an exact copy of the Python runtime I had and call the custom Python method myself.
I can do the reverse too: from Rust to Python. For that I compile the Rust functions that are part of the logical plan into a WASM modules that I call from the Python side. Using Apache Arrows IPC, this stuff is really easy to do!