datafusion Add simplify method to aggregate function

Which issue does this PR close?

Closes #9526.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

yes, added new test and example

Are there any user-facing changes?

no, changes are backward compatible

May 02 '24 17:05 milenkovicm

the change is quite similar to #9906 there are quite a lot optional/unused parameters in the method call, but i personally find it easier to understand as it is equivalent to AggregateFunction signature.

I wonder if ExprSimplifyResult::<T>::Original(T) where T can be something else than args (tuples included) would be a backward compatible change. This way we could send and return all required parameters without copy ... but this is discussion for different time

May 02 '24 17:05 milenkovicm

jsut thinking aloud, maybe if we change:

pub enum ExprSimplifyResult {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(Vec<Expr>),
}

to:

pub enum ExprSimplifyResult<T> {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(T),
}

simplify method would change from:

   pub fn simplify(
        &self,
        args: Vec<Expr>,
        distinct: &bool,
        filter: &Option<Box<Expr>>,
        order_by: &Option<Vec<Expr>>,
        null_treatment: &Option<NullTreatment>,
        info: &dyn SimplifyInfo,
    ) -> Result<ExprSimplifyResult> {

to:

   pub fn simplify(
        &self,
        args: Vec<Expr>,
        distinct: bool,
        filter: Option<Box<Expr>>,
        order_by: Option<Vec<Expr>>,
        null_treatment: Option<NullTreatment>,
        info: &dyn SimplifyInfo,
    ) -> Result<ExprSimplifyResult<(Vec<Expr>, bool, Option<Box<Expr>> ...)>> {
}

so in both cases, when we create replacement function or we re-assemble original function we would not need to clone parameters.

maybe hand full of changes, from -> Result<ExprSimplifyResult> to -> Result<ExprSimplifyResult<Vec<Expr>>>

Downside is that we would have to re-create expression if no simplification, plus change would be backward incompatible

May 03 '24 12:05 milenkovicm

pub enum ExprSimplifyResult<T> {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(T),
}

I think it is a good idea.

May 04 '24 03:05 jayzhan211

jsut thinking aloud, maybe if we change:

I think if we are going to change the signature, I think we should use Transformed directly (and it is already parameterized)

May 06 '24 19:05 alamb

Another option could be to make a simplify return optional closure

 fn simplify(&self ) -> Option<Box<dyn Fn(AggregateFunction) -> datafusion_common::tree_node::Transformed<Expr>>> {
        None
 }

in this case expr_simplifier part would become:

Expr::AggregateFunction(AggregateFunction {func_def: AggregateFunctionDefinition::UDF(ref udaf), ..}) => {
  match (udaf.simplify(), expr) {
    (Some(simplify), Expr::AggregateFunction(a)) => simplify(a),
    (_, e) => Transformed::no(e),
  }
}

this solution:

avoids cloning,
avoids decomposition/composition of function if not needed
avoids adding new struct
but, it has dynamic dispatch simplify function call (in case simplify is defined)
and a bit odd interface

other issues:

SimplifyInfo is missing from signature, can be added
Transformed::no does not make sense in this context as original expression has not been propagated (should we use actual expression instead of AggregateFunction or remove Transformed from return type
closure should return result with or without Transformed

May 07 '24 14:05 milenkovicm

Another option is to introduce has_simplify so we can simplify the code to

            Expr::AggregateFunction(AggregateFunction {
                func_def: AggregateFunctionDefinition::UDF(udaf),
                args: AggregateFunctionArgs,
            }) if udaf.has_simplify() => {
                Transformed::yes(udaf.simplify()?)
            }

            fn simplify() -> Result<Expr>

The downside is that the user needs to both define simplify and set has_simplify to true, but I think it is much simpler than the optional closure

May 08 '24 00:05 jayzhan211

The downside is that the user needs to both define simplify and set has_simplify to true, but I think it is much simpler than the optional closure

I personally try to avoid situations like this when creating API, this should be documented that both functions have to be implemented plus it introduces more corner cases.

Anyway, closure was a brain dump from yesterday, after a good sleep I don't think we need it, will try to elaborate in next comment

May 08 '24 07:05 milenkovicm

Option 1 - Original expression as a parameter

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know it'll always be AggregateFunction but we have to match on it 
        // which is not terribly hard but it makes api a bit odd 
        if let Expr::AggregateFunction(agg) = expr {
            ...
            Ok(datafusion_common::tree_node::Transformed::yes(...))
        } else {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
        }
    }

default implementation is just:

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
    }

Option 2 - AggregationFunction as a parameter

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know its aggregate function, no need to do extra matching, but we need to re-create expression 
        // issue user can return `no` with new expression in it 
        Ok(datafusion_common::tree_node::Transformed::no(Expr::AggregateFunction(agg)))
    }

Option 3 - AggregationFunction as a parameter with parametrised `ExprSimplifyResult`

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<ExprSimplifyResult<AggregateFunction>> {
       // user cant return `no` with new expression 
       // simplifier will re-crete expression in case original has been returned 
        Ok(ExprSimplifyResult::Original(agg))        
   }

IMHO, I prefer option number 3, which would involve parametrizing ExprSimplifyResult, second best is option number 2. wdyt @jayzhan211, @alamb

May 08 '24 08:05 milenkovicm

Option 1 - Original expression as a parameter

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know it'll always be AggregateFunction but we have to match on it 
        // which is not terribly hard but it makes api a bit odd 
        if let Expr::AggregateFunction(agg) = expr {
            ...
            Ok(datafusion_common::tree_node::Transformed::yes(...))
        } else {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
        }
    }

default implementation is just:

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
    }

Option 2 - AggregationFunction as a parameter

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know its aggregate function, no need to do extra matching, but we need to re-create expression 
        // issue user can return `no` with new expression in it 
        Ok(datafusion_common::tree_node::Transformed::no(Expr::AggregateFunction(agg)))
    }

Option 3 - AggregationFunction as a parameter with parametrised `ExprSimplifyResult`

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<ExprSimplifyResult<AggregateFunction>> {
       // user cant return `no` with new expression 
       // simplifier will re-crete expression in case original has been returned 
        Ok(ExprSimplifyResult::Original(agg))        
   }

IMHO, I prefer option number 3, which would involve parametrizing ExprSimplifyResult, second best is option number 2. wdyt @jayzhan211, @alamb

Given these 3, I prefer 2, because I think re-create expression is not a high cost that we should avoid. And we can do even further,

If user does not implement simplify, it is equivalent to Transformed::no or ExprSimplifyResult::Original, so we can just have Result<Expr>

fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<Expr> {
        Ok(Expr::AggregateFunction(agg))
    }

May 08 '24 09:05 jayzhan211

I apologise for making noise, but it looks like all of the non closure options have issue with borrowing:

error[E0505]: cannot move out of `expr.0` because it is borrowed
    --> datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:1391:48
     |
1388 |                 func_def: AggregateFunctionDefinition::UDF(ref udaf),
     |                                                            -------- borrow of `expr.0.func_def.0` occurs here
...
1391 |                 if let Expr::AggregateFunction(aggregate_function) = expr {
     |                                                ^^^^^^^^^^^^^^^^^^ move out of `expr.0` occurs here
1392 |                     udaf.simplify(aggregate_function, info)?
     |                     ---- borrow later used here

It looks like AggregateArg might be back on the table, which would involve cloning udf

May 08 '24 12:05 milenkovicm

@jayzhan211 and @alamb whenever you get chance please have a look, IMHO I find proposal with closure to tick all the boxes, but this one is definitely simpler.

May 09 '24 09:05 milenkovicm

@jayzhan211 and @alamb whenever you get chance please have a look, IMHO I find proposal with closure to tick all the boxes, but this one is definitely simpler.

After playing around, I feel like maybe optional closure is our answer 🤔

fn simplify(
        &self,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Expr>> {}

May 09 '24 12:05 jayzhan211

After playing around, I feel like maybe optional closure is our answer 🤔

Damn! I should have created a branch with that code 😀, if we have consensus on that direction I'll redo it

May 09 '24 12:05 milenkovicm

After playing around, I feel like maybe optional closure is our answer 🤔

Damn! I should have created a branch with that code 😀, if we have consensus on that direction I'll redo it

My final answer for today is

// UDAF
fn simplify(
        &self,
        args: AggregateArgs,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Result<Expr>>> {

// UDF, not necessary but for consistency
fn simplify(
        &self,
        args: Vec<Expr>,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(Vec<Expr>) -> Result<Expr>>> {

Go with optional closure!

May 09 '24 13:05 jayzhan211

My final answer for today is

fn simplify(
        &self,
        args: AggregateArgs,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Result<Expr>>> {

Go with optional closure!

you mean

// UDF
fn simplify(
        &self,
    ) -> Option<Box<dyn Fn(AggregateFunction, &dyn SimplifyInfo) -> Result<Expr>>> {

or

// UDF
fn simplify(
        &self,
        info: &dyn SimplifyInfo
    ) -> Option<Box<dyn Fn(AggregateFunction) -> Result<Expr>>> {

May 09 '24 13:05 milenkovicm

It should be this.

// UDF
fn simplify(
        &self,
    ) -> Option<Box<dyn Fn(AggregateFunction, &dyn SimplifyInfo) -> Result<Expr>>> {

The reason for optional closure is that I assume we need to return Expr for simplified cases. If we expect AggregateArgs, then we might not need closure.

May 09 '24 13:05 jayzhan211

We're on the same page @jayzhan211

May 09 '24 13:05 milenkovicm

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have

pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}

I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example.

@milenkovicm

May 09 '24 13:05 jayzhan211

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have
pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}
I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example.

@milenkovicm

I think we need Expr, this way we can replace function with something else

May 09 '24 13:05 milenkovicm

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have
pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}
I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example. @milenkovicm
I think we need Expr, this way we can replace function with something else

I agree.

May 09 '24 14:05 jayzhan211

You can define closure in datafusion/expr/src/function.rs

May 09 '24 23:05 jayzhan211

Thanks again @jayzhan211 and @milenkovicm 🙏

May 13 '24 11:05 alamb

Add simplify method to aggregate function

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Option 1 - Original expression as a parameter

Option 2 - AggregationFunction as a parameter

Option 3 - AggregationFunction as a parameter with parametrised ExprSimplifyResult

Option 1 - Original expression as a parameter

Option 2 - AggregationFunction as a parameter

Option 3 - AggregationFunction as a parameter with parametrised ExprSimplifyResult

Option 3 - AggregationFunction as a parameter with parametrised `ExprSimplifyResult`

Option 3 - AggregationFunction as a parameter with parametrised `ExprSimplifyResult`