ibis
ibis copied to clipboard
feat: Support Serialization of Ibis DataFrames for Distributed Execution with Ray
Is your feature request related to a problem?
❗ Problem
Attempting to send an Ibis DataFrame to Ray results in a pickling error:
AttributeError: 'Backend' object has no attribute '_record_batch_readers_consumed'
This appears to stem from the fact that Ibis connections (and by extension, DataFrames) are not currently serializable.
What is the motivation behind your request?
🧩 Use Case We are building a data analytics framework using Ibis DataFrames in combination with Ray for distributed execution. We plan to implement a function that accepts an Ibis DataFrame as a parameter and dispatches it to a Ray cluster for execution. However, since the Ibis DataFrame may use any backend (which is unknown to the function), we encounter serialization issues when passing it to Ray.
Describe the solution you'd like
✅ Feature Request We propose the following enhancements:
1. Long-Term Solution Enable Ibis DataFrames to be serialized and deserialized safely, allowing them to be passed across Ray workers or other distributed systems.
2. Interim Solution To work around the serialization issue, we plan to extract and transmit connection parameters manually. This requires access to internal attributes:
backend = ibis_df.get_backend()
backend_name = backend.name
con_args = backend._con_args
con_kwargs = backend._con_kwargs
On the Ray worker side, we re-establish the connection like so:
if backend_name == "duckdb":
con = ibis.duckdb.connect(*con_args, **con_kwargs)
For this workaround to be viable, we request that _con_args and _con_kwargs be exposed
We believe these changes would significantly improve Ibis's usability in distributed computing environments. Please let us know if there are recommended alternatives.
What version of ibis are you running?
10.4.0
What backend(s) are you using, if any?
DuckDB
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Thanks for getting in touch!
My hunch is that such a system should handle connections separate from Ibis expressions, and avoid so-called bound expressions entirely.
This means that whenever you're dealing with Ibis expressions they're
unbound, that is, they have no connection behind them (in-memory tables are
included in this description). Unbound tables are usually constructed with the
ibis.table function which accepts a schema-like object and optionally a table
name.
Then you're free to use whatever connection-argument-serialization handling approach you want.
Serializing connections isn't really a solvable problem as far as I can tell, due to
the fact that lots of important connection state is tied to a given process. In the case
of DuckDB, for example, Ibis make heavy use of TEMPORARY tables and views, whose lifetime
is by definition tied to the process that created them.
Thanks for the response. It would be great, if you could expose the internal variables _con_args and _con_kwargs