ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: add support for data source UDFs

Open rtpsw opened this issue 3 years ago • 8 comments

Currently, Ibis supports UDFs only for vectorized functions - called vectorized UDFs. This issue is for adding support for UDFs for data sources, starting with a discussion about it. A data source UDF node is similar to a table node that is implemented by a UDF embedded within it. The user-code implementing a data source might look something like:

@datasource(schema=...)
def read_special_data_source(...):
   ...
   return record_batch_stream # or other types of sources

and the code for a node embedding it might look something like ibis.datasource(udf=read_special_data_source).

rtpsw avatar Jun 20 '22 16:06 rtpsw

This is a feature I intend to implement once the design discussion reaches agreement.

cc @cpcloud, @icexelloss

rtpsw avatar Jun 20 '22 16:06 rtpsw

I suspect that we want to have a generic table UDF mechanism and not have something specific for data sources. Any function that produces rows should be eligible for use as a table function.

DuckDB has some useful prior art here with read_parquet and read_csv_auto, it might be worth looking into those for inspiration of the data model for these functions.

cpcloud avatar Jun 20 '22 17:06 cpcloud

I suspect that we want to have a generic table UDF mechanism and not have something specific for data sources.

Do you mean to extend the ibis.table API to support a UDF parameter? Or do you have some other API in mind?

Any function that produces rows should be eligible for use as a table function.

Agreed, this is what I meant in my example code. There are also multiple return-types of a UDF that could be considered valid for this purpose.

DuckDB has some useful prior art here with read_parquet and read_csv_auto, it might be worth looking into those for inspiration of the data model for these functions.

I found pages on parquet and CSV loading. However, these are pages that would be useful for someone implementing a UDF (as examples for reading parquet or CSV data), whereas my goal here is to design the Ibis API for integrating such UDFs into an Ibis expression. I'm not too experienced in designing Ibis APIs, so I preferred to have this discussion first.

rtpsw avatar Jun 20 '22 18:06 rtpsw

@rtpsw Thanks for this - I remember the mailing discussion concluded perhaps a design doc for more detailed discussion, I think what @cpcloud commented here could be good things to look into - Do you mind put together a simple design doc to circulate on the Arrow mailing list and include cpcloud and Weston to review?

icexelloss avatar Jun 21 '22 18:06 icexelloss

I think most Arrow folks do not watch ibis issues so maybe mailing list is still the better place for discussion of this

icexelloss avatar Jun 21 '22 18:06 icexelloss

@icexelloss, my main goal in this issue is to figure out a design that fits Ibis. I think it would be easier to present an end-to-end (Ibis/Ibis-Substrait-Arrow) design following this. Nevertheless, I'm currently experimenting with potential designs and may be able to use the experience so gained to cook up a design proposal doc like you're asking for.

rtpsw avatar Jun 21 '22 19:06 rtpsw

After some experimentation, I'd propose the Ibis/Ibis-Substrait design shown below. @cpcloud, could you provide some feedback?

In Ibis:

  • Add a new UserDefinedSourceTable operation derived from PhysicalTable that has similar validating members as VectorizedUDF (refactor to reuse code) as well as a schema consistent with its output type.
  • Add class UserDefinedTable which is similar to UserDefinedFunction (refactor to reuse code) except that it enforces zero-inputs and coercion to a dataframe.
  • Add a _udt_decorator decorator-factory similar to _udf_decorator except that it takes no input types and is based on UserDefinedTable.
  • Add a source decorator-factory which forwards output_type and passes UserDefinedSourceTable as node_type to _udt_decorator, The user will apply this decorator-factory to a function that sources data.

In Ibis-Substrait:

  • In the protobuf definitions, add to ReadRel a new read_type with name udt and type UserDefinedTable having code, summary, and description fields.
  • Register a UserDefinedSourceTable-type dispatch for translate that populates the base_schema and udt fields of ReadRel. In particular, use the code field under udt to serialize (using cloudpickle and base64) the function carried by the UserDefinedSourceTable operation.

While the focus of the current issue is on the Ibis design, I added above the Ibis-Substrait design to support the design evaluation. I'm leaving out the Arrow design - I already have working (Py)Arrow code that can deal with serialized Python functions.

rtpsw avatar Jun 22 '22 12:06 rtpsw

This makes sense to me.

icexelloss avatar Jun 22 '22 13:06 icexelloss

Closing as a duplicate of #4582.

cpcloud avatar Mar 16 '23 13:03 cpcloud