ibis-substrait
ibis-substrait copied to clipboard
Ibis has no way to convert UDFs to substrait plan
Ibis is doing some incredible work by integrating substrait for generating substrait plan of the user's query to support cross DB operations in python.
Suppose we have a table like this :
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ cust_id ┃ income1 ┃ income2 ┃ income3 ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│ int64 │ float64 │ float64 │ float64 │
├─────────┼─────────┼─────────┼──────────┤
│ 1 │ 20000.0 │ 3560.57 │ nan │
│ 2 │ 34546.9 │ 6000.66 │ 1000.0 │
│ 3 │ 75430.2 │ 8111.01 │ nan │
│ 4 │ 55430.2 │ 8111.01 │ 1200.0 │
│ 5 │ nan │ 8111.01 │ nan │
│ 6 │ nan │ nan │ 100000.0 │
└─────────┴─────────┴─────────┴──────────┘
Right now we define udf's in ibis like this
import ibis.expr.datatypes as dt
from ibis.backends.pandas.udf import udf
@udf.analytic(input_type = [dt.double, dt.double], output_type=dt.double)
def function(c1, c2):
return c1 + c2
And hence we can apply this function to our tables like this
function(table.income1, table.income2)
And applying this function returns this
┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ AnalyticVectorizedUDF() ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64 │
├─────────────────────────┤
│ 23560.57 │
│ 40547.56 │
│ 83541.21 │
│ 63541.21 │
│ nan │
│ nan │
└─────────────────────────┘
Even we can mutate our existing table to add a new column with this function.
mutate_expression = table.mutate(
added = function(table.income1, table.income2)
)
Before coming to the main problem, consider this, I have a simple expression like this
expression = table.income1 + table.income2
And now I can generate the substrait plan of this expression using this code :
from ibis_substrait.compiler.core import SubstraitCompiler
compiler = SubstraitCompiler()
expression = table.income1 + table.income2
substrait_plan = compiler.compile(table.mutate(expression))
Hence I can get the substrait plan. But when I am trying to get the substrait plan through an user defined function then I am getting this error:
udf_expression = table.mutate(
added = function(table.income1, table.income2)
)
substrait_plan_udf = compiler.compile(table.mutate(udf_expression))
Doing this gives me the error : KeyError: 'AnalyticVectorizedUDF'.
I even thought that substrait might also not provide the support for now. But it seems like substrait do support :
- UserDefined defined
type - ParameterizedUserDefined
type - UserDefined
relation
But Not user defined relations.
This concludes that ibis is not supporting generating substrait plans for user defined functions. But it will be awesome if we have one.
Hi @Anindyadeep -- thanks for raising this!
We're currently refactoring UDF support in Ibis to try to make it more consistent across backends. Once that lands, we can look in to how we can better support UDF compilation to substrait.
There's currently support for element-wise UDFs, tested against pyarrow.compute -- note that this does require registering the UDF separately with the backend. There's not currently a way (that I'm aware of) to serialize a UDF into a substrait plan. Instead, the current pattern is to register a function with the backend, then refer to it by name in the substrait plan.
@gforsyth , Can you please provide me more details or the link of that (in the documentation) where I can refer to, about registering a function with the backend and then refer it by name in the substrait plan. Also is that thing to be done with substrait manually or can be done using ibis.
Thanks
Documentation is very light (sorry). There's an integration test that shows how to create an elementwise UDF in Ibis, then registering that UDF with pyarrow.compute, then executing it using Substrait to send the plan from Ibis to PyArrow.
https://github.com/ibis-project/ibis-substrait/blob/main/ibis_substrait/tests/integration/test_pyarrow.py
It doesn't require using Substrait directly, but it does require defining the UDF in the pyarrow process manually.
Could not describe my questions better than you did @Anindyadeep! I'll also mention that this feature could be very useful. In the meanwhile, your answers @gforsyth are very useful, thank you!
Thanks @OmriLevyTau, but yes this feature would be super useful.