ibis
ibis copied to clipboard
feat: add support for data source UDFs
Currently, Ibis supports UDFs only for vectorized functions - called vectorized UDFs. This issue is for adding support for UDFs for data sources, starting with a discussion about it. A data source UDF node is similar to a table node that is implemented by a UDF embedded within it. The user-code implementing a data source might look something like:
@datasource(schema=...)
def read_special_data_source(...):
...
return record_batch_stream # or other types of sources
and the code for a node embedding it might look something like ibis.datasource(udf=read_special_data_source).
This is a feature I intend to implement once the design discussion reaches agreement.
cc @cpcloud, @icexelloss
I suspect that we want to have a generic table UDF mechanism and not have something specific for data sources. Any function that produces rows should be eligible for use as a table function.
DuckDB has some useful prior art here with read_parquet and read_csv_auto, it might be worth looking into those for inspiration of the data model for these functions.
I suspect that we want to have a generic table UDF mechanism and not have something specific for data sources.
Do you mean to extend the ibis.table API to support a UDF parameter? Or do you have some other API in mind?
Any function that produces rows should be eligible for use as a table function.
Agreed, this is what I meant in my example code. There are also multiple return-types of a UDF that could be considered valid for this purpose.
DuckDB has some useful prior art here with
read_parquetandread_csv_auto, it might be worth looking into those for inspiration of the data model for these functions.
I found pages on parquet and CSV loading. However, these are pages that would be useful for someone implementing a UDF (as examples for reading parquet or CSV data), whereas my goal here is to design the Ibis API for integrating such UDFs into an Ibis expression. I'm not too experienced in designing Ibis APIs, so I preferred to have this discussion first.
@rtpsw Thanks for this - I remember the mailing discussion concluded perhaps a design doc for more detailed discussion, I think what @cpcloud commented here could be good things to look into - Do you mind put together a simple design doc to circulate on the Arrow mailing list and include cpcloud and Weston to review?
I think most Arrow folks do not watch ibis issues so maybe mailing list is still the better place for discussion of this
@icexelloss, my main goal in this issue is to figure out a design that fits Ibis. I think it would be easier to present an end-to-end (Ibis/Ibis-Substrait-Arrow) design following this. Nevertheless, I'm currently experimenting with potential designs and may be able to use the experience so gained to cook up a design proposal doc like you're asking for.
After some experimentation, I'd propose the Ibis/Ibis-Substrait design shown below. @cpcloud, could you provide some feedback?
In Ibis:
- Add a new
UserDefinedSourceTableoperation derived fromPhysicalTablethat has similar validating members asVectorizedUDF(refactor to reuse code) as well as a schema consistent with its output type. - Add
class UserDefinedTablewhich is similar toUserDefinedFunction(refactor to reuse code) except that it enforces zero-inputs and coercion to a dataframe. - Add a
_udt_decoratordecorator-factory similar to_udf_decoratorexcept that it takes no input types and is based onUserDefinedTable. - Add a
sourcedecorator-factory which forwardsoutput_typeand passesUserDefinedSourceTableasnode_typeto_udt_decorator, The user will apply this decorator-factory to a function that sources data.
In Ibis-Substrait:
- In the protobuf definitions, add to
ReadRela newread_typewith nameudtand typeUserDefinedTablehavingcode,summary, anddescriptionfields. - Register a
UserDefinedSourceTable-type dispatch fortranslatethat populates thebase_schemaandudtfields ofReadRel. In particular, use thecodefield underudtto serialize (using cloudpickle and base64) the function carried by theUserDefinedSourceTableoperation.
While the focus of the current issue is on the Ibis design, I added above the Ibis-Substrait design to support the design evaluation. I'm leaving out the Arrow design - I already have working (Py)Arrow code that can deal with serialized Python functions.
This makes sense to me.
Closing as a duplicate of #4582.