datafusion [EPIC] Unify Function Interface (remove `BuiltInScalarFunction`)

Is your feature request related to a problem or challenge?

This is based on the wonderful writeup from @2010YOUY01 in https://github.com/apache/arrow-datafusion/issues/7977

As previously discussed in https://github.com/apache/arrow-datafusion/issues/7110 https://github.com/apache/arrow-datafusion/pull/7752 there are a few challenges with how ScalarFunctions are handled, notably that there are two distinct implementations -- BuiltinScalarFunction and ScalarUDF

Problems with `BuiltinScalarFunction`

As more functions are added, the total footprint of DataFusion grows, even for those who don't need the specific functions. This also acts to limit the number of functions built into DataFusion
The desired semantics may be different for different users(e.g. many of the built in functions in DataFusion mirror postgres behavior, but some users wish to mimic spark behavior)
User defined functions are treated differently from built in functions in some ways (e.g. they can't have aliases)
Adding a new built in function requires modifications in multiple places which makes the barrier overly high.Built-in functions are implemented with Enum BuiltinScalarFunction, and function implementations like return_type() are large methods that match every enum variant.

Problems with `ScalarUDF`

The current implementation of ScalarUDFs is a struct, and does not cover all the functionalities of existing built-in functions
Defining a new ScalarUDF requires constructing a struct in an imperative way providing Arc function pointers (see examples/simple_udf.rs) for each part of the UDF, which is not familiar to Rust users where it is more common to see dyn Trait objects

Describe the solution you'd like

I propose moving DataFusion to only use ScalarUDFs and remove BuiltInScalarFunction. This will ensure:

ScalarUDFs have access to all the same functionality as "built in " functions.
No function specific code will escape the planning phase
DataFusion's core can remain focused, and external libraries of packages can be used to customize its use.

We will keep the existing ScalarUDF interface as much as possible, while also potentially providing an easier way to define them (ideally via a trait object).

Describe alternatives you've considered

https://github.com/apache/arrow-datafusion/issues/7977 describes introducing a new trait and unifying both ScalarUDF and BuiltInScalarFunction with this trait.

This approach also allows gradually migrating existing built-in functions to the new one, the old UDF interface create_udf() can keep unchanged.

However, I think it is a bigger change for users, and has the danger of making the overall complexity of DataFusion worse. As demonstrated in https://github.com/apache/arrow-datafusion/pull/8046 it is also feasible to allow new ScalarUDFs to be defined using a trait while retaining backwards compatibility for existing ScalarUDF implementations

Additional context

Proposed implementation steps:

[x] Prototype ScalarUDF interface changes (make the fields non pub): https://github.com/apache/arrow-datafusion/pull/8039
[x] Prototype how registering external packages would look like (by making a prototype for some BuildInFunctions): https://github.com/apache/arrow-datafusion/pull/8046
[x] Propose ScalarUDF API changes for real: https://github.com/apache/arrow-datafusion/pull/8079
[x] https://github.com/apache/arrow-datafusion/pull/8059
[x] List additional feature gaps between built in functions and ScalarUDfs and close them
[x] https://github.com/apache/arrow-datafusion/pull/8114
[x] https://github.com/apache/arrow-datafusion/issues/8346
[x] https://github.com/apache/arrow-datafusion/issues/8347
[x] https://github.com/apache/arrow-datafusion/issues/8756
[x] https://github.com/apache/arrow-datafusion/issues/8157
[ ] https://github.com/apache/arrow-datafusion/issues/8051
[x] https://github.com/apache/arrow-datafusion/issues/8712
[x] https://github.com/apache/arrow-datafusion/pull/8088
[x] Implement aliasing for ScalarUDF: https://github.com/apache/arrow-datafusion/issues/8348
[x] Implement trait based ScalarUDF: https://github.com/apache/arrow-datafusion/issues/8568
[x] https://github.com/apache/arrow-datafusion/issues/9100
[x] Create a new datafusion-function crate with an initial set of functions as a model (see https://github.com/apache/arrow-datafusion/pull/8046)
[x] Create tickets for extracting the remaining lists of packages into the datafusion_functions crate, file tickets to track them (https://github.com/apache/arrow-datafusion/issues/9285)
[ ] https://github.com/apache/arrow-datafusion/issues/9285
[x] https://github.com/apache/arrow-datafusion/issues/9074
[x] File follow on tickets for applying the same treatment to AggregateUDF https://github.com/apache/arrow-datafusion/issues/8708 and WindowUDF https://github.com/apache/arrow-datafusion/issues/8709

Nov 03 '23 18:11 alamb

Would like to add that supporting serialization of user-defined functions would be quite nice. The current approach to serialization is to basically just use a string and so if you have a scalar udf with some constant parameter you end up having to encode the constant as a literal expression which is less than ideal. Providing a mechanism (through PhysicalExtensionCodec I assume?) to serialize udfs would be very nice.

Nov 03 '23 19:11 thinkharderdev

Would like to add that supporting serialization of user-defined functions would be quite nice.

I don't understand this question @thinkharderdev 🤔

The current approach to serialize a ScalarUDF as I understand it, is to send over the function name along with the arguments

https://github.com/apache/arrow-datafusion/blob/c2e768052c43e4bab6705ee76befc19de383c2cb/datafusion/proto/proto/datafusion.proto#L686-L689

And then the physical plan deserialization looks that function up by name https://docs.rs/datafusion-proto/latest/src/datafusion_proto/physical_plan/from_proto.rs.html#318-333

The only difference compared to BuiltInScalarFunction is that BuiltInScalarFunction uses a number (in an enum) which might be slightly faster

Nov 03 '23 21:11 alamb

Would like to add that supporting serialization of user-defined functions would be quite nice.

I don't understand this question @thinkharderdev 🤔

The current approach to serialize a ScalarUDF as I understand it, is to send over the function name along with the arguments

https://github.com/apache/arrow-datafusion/blob/c2e768052c43e4bab6705ee76befc19de383c2cb/datafusion/proto/proto/datafusion.proto#L686-L689

And then the physical plan deserialization looks that function up by name https://docs.rs/datafusion-proto/latest/src/datafusion_proto/physical_plan/from_proto.rs.html#318-333

The only difference compared to BuiltInScalarFunction is that BuiltInScalarFunction uses a number (in an enum) which might be slightly faster

Sorry, what I mean is that it would be useful to be able to serialize constant parameters into the user-defined scalar function themselves rather than pass them in as expressions. So for instance if you had to create a UDF to do something with a regex that you have as a static constant. Currently the way to do that is pass it as a literal expression. But then you have to compile the regex again for every batch you process through the UDF. Ideally you could have something like:

struct MyRegexUdf {
  regex: Regex
}

impl ScalarFunction for MyRegexUdf {
  // use regex on each value somehow
}

The regex would only need to be compiled once during deserialization (or construction) instead of once for each batch.

Nov 04 '23 00:11 thinkharderdev

Sorry, what I mean is that it would be useful to be able to serialize constant parameters into the user-defined scalar function themselves rather than pass them in as expressions

This makes total sense -- I filed https://github.com/apache/arrow-datafusion/issues/8051 to track the issue

Nov 04 '23 10:11 alamb

Here is a PR that shows what DataFusion might look like after removing BuiltInScalarFunction and moving everything to ScalarUDF: https://github.com/apache/arrow-datafusion/pull/8046

Nov 04 '23 10:11 alamb

According to the previous prototyping in https://github.com/apache/arrow-datafusion/pull/7978, we might need to do several cleanups towards this issue.

Following 2 is necessary for separating `encode()/decode()` and make API change (in https://github.com/apache/arrow-datafusion/pull/8046)

return_type() now do both input validation and generating output, these responsibilities have to be separated
resolve expr_fn using function string names as discussed in https://github.com/apache/arrow-datafusion/pull/8046#issuecomment-1795967063

Functionalitiy gaps between `BuiltinScalarFunction` and `ScalarUDF` include

aliasing
monotonicity
Figure out the interface to let core pass information to function definitions (e.g. now() requires to be passed query start time from core) https://github.com/apache/arrow-datafusion/blob/308c35404939ed39e4e398054d6fc78c89817109/datafusion/physical-expr/src/functions.rs#L608

And other clean-up tasks might be helpful

Can they be moved into function implementations https://github.com/apache/arrow-datafusion/blob/308c35404939ed39e4e398054d6fc78c89817109/datafusion/physical-expr/src/functions.rs#L75-L204

Nov 07 '23 06:11 2010YOUY01

Thank you for the investigation @2010YOUY01

monotonicity

This gap may be one of the gaps we are observing in IOx from a recent update from @mustafasrepo and @ozankabak in https://github.com/apache/arrow-datafusion/pull/8006 / https://github.com/apache/arrow-datafusion/issues/8043🤔

I will try and make it a priority to get a PR up to encapsuate the ScalarUDF fields so we can begin closing these gaps and I will update the list above

Nov 07 '23 13:11 alamb

Figure out the interface to let core pass information to function definitions (e.g. now() requires to be passed query start time from core)

This is a similar usecase to what @thinkharderdev describes on https://github.com/apache/arrow-datafusion/issues/8051 -- I wrote up some ideas https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1798636241

Nov 07 '23 14:11 alamb

Can they be moved into function implementations

@2010YOUY01 -- I totally agree this could be cleaned up. Moving it to specific implementations would be great. Could you make a PR to do so?

Nov 07 '23 14:11 alamb

@2010YOUY01 I have updated the task list on this ticket based on your investigation. Please take a look when you have a chance.

The only one I don't understand is

return_type() now do both input validation and generating output, these responsibilities have to be separated

Can you explain why validation of input types + output type calculation need to be separated?

Nov 07 '23 15:11 alamb

@2010YOUY01 I have updated the task list on this ticket based on your investigation. Please take a look when you have a chance.

The only one I don't understand is

return_type() now do both input validation and generating output, these responsibilities have to be separated

Can you explain why validation of input types + output type calculation need to be separated?

I think it's a simple refactor, I can open a PR to better explain and solve this issue

Can they be moved into function implementations

@2010YOUY01 -- I totally agree this could be cleaned up. Moving it to specific implementations would be great. Could you make a PR to do so?

Sure

Nov 08 '23 01:11 2010YOUY01

Thanks for driving this forward @2010YOUY01 -- it is very much appreciated.

I am planning to merge https://github.com/apache/arrow-datafusion/pull/8079 in 4 more days (after it has been open for a week to ensure anyone who wants has had a chance to review it)

In the mean time, I was thinking what other work could be done while we are waiting. I see two options:

We can just start on other tasks as draft PRs that build on https://github.com/apache/arrow-datafusion/pull/8079
We could potentially start on the following:

Implement some way for expr_fn to work using function string names rather than fully resolved ScalarUDF as discussed in https://github.com/apache/arrow-datafusion/pull/8046#issuecomment-1795967063)

What do you think? Is any of that interesting to you?

Nov 10 '23 21:11 alamb

Thanks for driving this forward @2010YOUY01 -- it is very much appreciated.

I am planning to merge #8079 in 4 more days (after it has been open for a week to ensure anyone who wants has had a chance to review it)

In the mean time, I was thinking what other work could be done while we are waiting. I see two options:

We can just start on other tasks as draft PRs that build on Make fields of ScalarUDF , AggregateUDF and WindowUDF non pub #8079

We could potentially start on the following:

Implement some way for expr_fn to work using function string names rather than fully resolved ScalarUDF as discussed in #8046 (comment))

What do you think? Is any of that interesting to you?

I'm going to work on 2 in the next few days. I think this expr_fn task should better be done before migrating the first function.

Nov 11 '23 05:11 2010YOUY01

I'm going to work on 2 in the next few days. I think this expr_fn task should better be done before migrating the first function.

Thanks! one thought I had was to introduce a new Expr:: variant, and port one or two of the expr_fn's to show how it works and the the first PR through. Then as a second PR we can port over the remaining expr_fns (which will be a larger PR but very mechanical)

Nov 11 '23 11:11 alamb

Added some more items to this list:

[ ] https://github.com/apache/arrow-datafusion/issues/8346
[ ] https://github.com/apache/arrow-datafusion/issues/8347
[ ] Implement aliasing for ScalarUDF: https://github.com/apache/arrow-datafusion/issues/8348

Nov 28 '23 21:11 alamb

datafusion datafusion copied to clipboard

[EPIC] Unify Function Interface (remove `BuiltInScalarFunction`)

Is your feature request related to a problem or challenge?

Problems with BuiltinScalarFunction

Problems with ScalarUDF

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Following 2 is necessary for separating encode()/decode() and make API change (in https://github.com/apache/arrow-datafusion/pull/8046)

Functionalitiy gaps between BuiltinScalarFunction and ScalarUDF include

And other clean-up tasks might be helpful

datafusion
datafusion copied to clipboard

Problems with `BuiltinScalarFunction`

Problems with `ScalarUDF`

Following 2 is necessary for separating `encode()/decode()` and make API change (in https://github.com/apache/arrow-datafusion/pull/8046)

Functionalitiy gaps between `BuiltinScalarFunction` and `ScalarUDF` include