datafusion
datafusion copied to clipboard
Add a ScalarUDFImpl::simplfy() API
Is your feature request related to a problem or challenge?
As part of porting arrow_cast
https://github.com/apache/arrow-datafusion/issues/9287 and make_array
https://github.com/apache/arrow-datafusion/issues/9288, it has become clear that some functions have special simplification semantics:
-
arrow_cast
simplifies tocast
of a particular data type. It is important for the rest of datafusion to understand theCast
semantics as the there are several special cases for Expr::Cast (such asunwrap_cast_in_comparison
) -
make_array
has special rewrite rules to combine / fold witharray_append
andarray_prepaend
in the simplifier (source link)
Also I think some systems may want to provide the ability to define UDFs that are more like macros (can be expressed in terms of other built in expressions), which needs some way for datafusion to "inline" the definition
Similarly, specialized functions (e.g replace regexp_match
with a version that had the regexp pre-compiled ...) - https://github.com/apache/arrow-datafusion/issues/8051 sound very similar
Describe the solution you'd like
I propose adding a function such as the following, we could implement the simplifications for make_array
and arrow_cast
in the UDF (and not in the main simplify code):
/// Was the expression simplified?
enum Simplified {
/// The function call was simplified to an entirely new Expr
Rewritten(Expr),
/// the function call could not be simplified, and the arguments
/// are return unmodified
Original(Vec<Expr>)
}
pub trait ScalarUDFImpl {
...
/// Apply any function specific simplification rules
///
/// Some functions like arrow_cast have special semantic simplifications
/// (into `Expr::Cast` for example) that can improve planning.
///
/// If there is a simpler representation of calling this function with the
/// specified arguments, return `Simplified::Rewritten` with the simplification.
/// If no such simplification was possible, returns `Simplified::Original` with
/// the unmodified arguments (the default)
///
/// This function should only apply simplifications specific to this function.
/// DataFusion will automatically simplify the arguments with a variety
/// of rewrites during optimization
fn simplify(&self, args: Vec<Expr>) -> Result<Simplified> {
Ok(Simplified::Original(args)
}
...
}
Describe alternatives you've considered
There may be better
Additional context
No response
FYI @jayzhan211 -- I wonder if you have any thoughts about this ticket (or maybe have time to try implementing it).
I think a good test case would be to implement a udf function such as cast_to_int64
that was rewritten into cast(x as int64)
or something
cc @universalmind303 and @thinkharderdev as I think you have been thinking about similar things
In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.
In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.
Yes, @Lordworms I think that is an excellent idea. cc @mustafasrepo for your thoughts
Note that https://github.com/apache/arrow-datafusion/issues/9213 is for an aggregate function, which I think would have a similar API though it would be on a different struct (AggreagateUDF
)
In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.
Yes, @Lordworms I think that is an excellent idea. cc @mustafasrepo for your thoughts
Note that #9213 is for an aggregate function, which I think would have a similar API though it would be on a different struct (
AggreagateUDF
)
I think, function level optimizer would be much more scalable. Also, simplification logic will be more contained. Also, once we can bring similar changes in ScalarUDFImpl::simplfy
to AggregateUDF
for support of this feature.
@jayzhan211 -- I think this proposal is also likely to be needed to get make_array
moved into datafusion-functions-array - https://github.com/apache/arrow-datafusion/issues/9288
@alamb How is simplify
like for array concat operator rewrite? If we have MakeArrayUDF, args
are for MakeArray only, but we need BinaryExpr that includes two array functions and 1 operator to start the simplification. It seems function level optimizer can only works for simplification like A function to B Expr / function.
simplify for arrow_cast
is just trying to do actual casting x as i64
right?
@alamb How is
simplify
like for array concat operator rewrite? If we have MakeArrayUDF,args
are for MakeArray only, but we need BinaryExpr that includes two array functions and 1 operator to start the simplification. It seems function level optimizer can only works for simplification like A function to B Expr / function.
🤔 that is a good point that the simplification is needed for StringConcat
operator 🤔
simplify for arrow_cast is just trying to do actual casting x as i64 right?
Yes
Sorry for my delay, I have been out the last few days. I am catching up with reviews