ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: Support Serialization of Ibis DataFrames for Distributed Execution with Ray

Open Sara25776 opened this issue 6 months ago • 2 comments

Is your feature request related to a problem?

Problem

Attempting to send an Ibis DataFrame to Ray results in a pickling error:

AttributeError: 'Backend' object has no attribute '_record_batch_readers_consumed'

This appears to stem from the fact that Ibis connections (and by extension, DataFrames) are not currently serializable.

What is the motivation behind your request?

🧩 Use Case We are building a data analytics framework using Ibis DataFrames in combination with Ray for distributed execution. We plan to implement a function that accepts an Ibis DataFrame as a parameter and dispatches it to a Ray cluster for execution. However, since the Ibis DataFrame may use any backend (which is unknown to the function), we encounter serialization issues when passing it to Ray.

Describe the solution you'd like

Feature Request We propose the following enhancements:

1. Long-Term Solution Enable Ibis DataFrames to be serialized and deserialized safely, allowing them to be passed across Ray workers or other distributed systems.

2. Interim Solution To work around the serialization issue, we plan to extract and transmit connection parameters manually. This requires access to internal attributes:

backend = ibis_df.get_backend()
backend_name = backend.name
con_args = backend._con_args
con_kwargs = backend._con_kwargs

On the Ray worker side, we re-establish the connection like so:

if backend_name == "duckdb":
    con = ibis.duckdb.connect(*con_args, **con_kwargs)

For this workaround to be viable, we request that _con_args and _con_kwargs be exposed

We believe these changes would significantly improve Ibis's usability in distributed computing environments. Please let us know if there are recommended alternatives.

What version of ibis are you running?

10.4.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

Sara25776 avatar May 26 '25 11:05 Sara25776

Thanks for getting in touch!

My hunch is that such a system should handle connections separate from Ibis expressions, and avoid so-called bound expressions entirely.

This means that whenever you're dealing with Ibis expressions they're unbound, that is, they have no connection behind them (in-memory tables are included in this description). Unbound tables are usually constructed with the ibis.table function which accepts a schema-like object and optionally a table name.

Then you're free to use whatever connection-argument-serialization handling approach you want.

Serializing connections isn't really a solvable problem as far as I can tell, due to the fact that lots of important connection state is tied to a given process. In the case of DuckDB, for example, Ibis make heavy use of TEMPORARY tables and views, whose lifetime is by definition tied to the process that created them.

cpcloud avatar May 26 '25 11:05 cpcloud

Thanks for the response. It would be great, if you could expose the internal variables _con_args and _con_kwargs

Sara25776 avatar May 27 '25 14:05 Sara25776