datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add a ScalarUDFImpl::simplfy() API

Open alamb opened this issue 1 year ago • 8 comments

Is your feature request related to a problem or challenge?

As part of porting arrow_cast https://github.com/apache/arrow-datafusion/issues/9287 and make_array https://github.com/apache/arrow-datafusion/issues/9288, it has become clear that some functions have special simplification semantics:

  1. arrow_cast simplifies to cast of a particular data type. It is important for the rest of datafusion to understand the Cast semantics as the there are several special cases for Expr::Cast (such as unwrap_cast_in_comparison)
  2. make_array has special rewrite rules to combine / fold with array_append and array_prepaend in the simplifier (source link)

Also I think some systems may want to provide the ability to define UDFs that are more like macros (can be expressed in terms of other built in expressions), which needs some way for datafusion to "inline" the definition

Similarly, specialized functions (e.g replace regexp_match with a version that had the regexp pre-compiled ...) - https://github.com/apache/arrow-datafusion/issues/8051 sound very similar

Describe the solution you'd like

I propose adding a function such as the following, we could implement the simplifications for make_array and arrow_cast in the UDF (and not in the main simplify code):

/// Was the expression simplified?
enum Simplified {
  /// The function call was simplified to an entirely new Expr
  Rewritten(Expr),
  /// the function call could not be simplified, and the arguments
  /// are return unmodified
  Original(Vec<Expr>)
}
pub trait ScalarUDFImpl {
...

  /// Apply any function specific simplification rules
  ///
  /// Some functions like arrow_cast have special semantic simplifications
  /// (into `Expr::Cast` for example) that can improve planning.
  ///
  /// If there is a simpler representation of calling this function with the
  /// specified arguments, return `Simplified::Rewritten` with the simplification.
  /// If no such simplification was possible, returns `Simplified::Original` with
  /// the unmodified arguments (the default)
  ///
  /// This function should only apply simplifications specific to this function.
  /// DataFusion will automatically simplify the arguments with a variety
  /// of rewrites during optimization
  fn simplify(&self, args: Vec<Expr>) -> Result<Simplified> {
    Ok(Simplified::Original(args)
  }
  ...

}

Describe alternatives you've considered

There may be better

Additional context

No response

alamb avatar Feb 20 '24 08:02 alamb

FYI @jayzhan211 -- I wonder if you have any thoughts about this ticket (or maybe have time to try implementing it).

I think a good test case would be to implement a udf function such as cast_to_int64 that was rewritten into cast(x as int64) or something

alamb avatar Feb 20 '24 08:02 alamb

cc @universalmind303 and @thinkharderdev as I think you have been thinking about similar things

alamb avatar Feb 20 '24 08:02 alamb

In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.

Lordworms avatar Feb 21 '24 02:02 Lordworms

In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.

Yes, @Lordworms I think that is an excellent idea. cc @mustafasrepo for your thoughts

Note that https://github.com/apache/arrow-datafusion/issues/9213 is for an aggregate function, which I think would have a similar API though it would be on a different struct (AggreagateUDF)

alamb avatar Feb 21 '24 07:02 alamb

In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.

Yes, @Lordworms I think that is an excellent idea. cc @mustafasrepo for your thoughts

Note that #9213 is for an aggregate function, which I think would have a similar API though it would be on a different struct (AggreagateUDF)

I think, function level optimizer would be much more scalable. Also, simplification logic will be more contained. Also, once we can bring similar changes in ScalarUDFImpl::simplfy to AggregateUDF for support of this feature.

mustafasrepo avatar Feb 21 '24 07:02 mustafasrepo

@jayzhan211 -- I think this proposal is also likely to be needed to get make_array moved into datafusion-functions-array - https://github.com/apache/arrow-datafusion/issues/9288

alamb avatar Feb 21 '24 07:02 alamb

@alamb How is simplify like for array concat operator rewrite? If we have MakeArrayUDF, args are for MakeArray only, but we need BinaryExpr that includes two array functions and 1 operator to start the simplification. It seems function level optimizer can only works for simplification like A function to B Expr / function.

jayzhan211 avatar Feb 21 '24 09:02 jayzhan211

simplify for arrow_cast is just trying to do actual casting x as i64 right?

jayzhan211 avatar Feb 21 '24 09:02 jayzhan211

@alamb How is simplify like for array concat operator rewrite? If we have MakeArrayUDF, args are for MakeArray only, but we need BinaryExpr that includes two array functions and 1 operator to start the simplification. It seems function level optimizer can only works for simplification like A function to B Expr / function.

🤔 that is a good point that the simplification is needed for StringConcat operator 🤔

simplify for arrow_cast is just trying to do actual casting x as i64 right?

Yes

Sorry for my delay, I have been out the last few days. I am catching up with reviews

alamb avatar Feb 26 '24 17:02 alamb