Add simplify method to aggregate function
Which issue does this PR close?
Closes #9526.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
yes, added new test and example
Are there any user-facing changes?
no, changes are backward compatible
the change is quite similar to #9906 there are quite a lot optional/unused parameters in the method call, but i personally find it easier to understand as it is equivalent to AggregateFunction signature.
I wonder if ExprSimplifyResult::<T>::Original(T) where T can be something else than args (tuples included) would be a backward compatible change. This way we could send and return all required parameters without copy ... but this is discussion for different time
jsut thinking aloud, maybe if we change:
pub enum ExprSimplifyResult {
/// The function call was simplified to an entirely new Expr
Simplified(Expr),
/// the function call could not be simplified, and the arguments
/// are return unmodified.
Original(Vec<Expr>),
}
to:
pub enum ExprSimplifyResult<T> {
/// The function call was simplified to an entirely new Expr
Simplified(Expr),
/// the function call could not be simplified, and the arguments
/// are return unmodified.
Original(T),
}
simplify method would change from:
pub fn simplify(
&self,
args: Vec<Expr>,
distinct: &bool,
filter: &Option<Box<Expr>>,
order_by: &Option<Vec<Expr>>,
null_treatment: &Option<NullTreatment>,
info: &dyn SimplifyInfo,
) -> Result<ExprSimplifyResult> {
to:
pub fn simplify(
&self,
args: Vec<Expr>,
distinct: bool,
filter: Option<Box<Expr>>,
order_by: Option<Vec<Expr>>,
null_treatment: Option<NullTreatment>,
info: &dyn SimplifyInfo,
) -> Result<ExprSimplifyResult<(Vec<Expr>, bool, Option<Box<Expr>> ...)>> {
}
so in both cases, when we create replacement function or we re-assemble original function we would not need to clone parameters.
maybe hand full of changes, from -> Result<ExprSimplifyResult> to -> Result<ExprSimplifyResult<Vec<Expr>>>
Downside is that we would have to re-create expression if no simplification, plus change would be backward incompatible
pub enum ExprSimplifyResult<T> { /// The function call was simplified to an entirely new Expr Simplified(Expr), /// the function call could not be simplified, and the arguments /// are return unmodified. Original(T), }
I think it is a good idea.
jsut thinking aloud, maybe if we change:
I think if we are going to change the signature, I think we should use Transformed directly (and it is already parameterized)
Another option could be to make a simplify return optional closure
fn simplify(&self ) -> Option<Box<dyn Fn(AggregateFunction) -> datafusion_common::tree_node::Transformed<Expr>>> {
None
}
in this case expr_simplifier part would become:
Expr::AggregateFunction(AggregateFunction {func_def: AggregateFunctionDefinition::UDF(ref udaf), ..}) => {
match (udaf.simplify(), expr) {
(Some(simplify), Expr::AggregateFunction(a)) => simplify(a),
(_, e) => Transformed::no(e),
}
}
this solution:
- avoids cloning,
- avoids decomposition/composition of function if not needed
- avoids adding new struct
- but, it has dynamic dispatch simplify function call (in case simplify is defined)
- and a bit odd interface
other issues:
SimplifyInfois missing from signature, can be addedTransformed::nodoes not make sense in this context as original expression has not been propagated (should we use actual expression instead of AggregateFunction or removeTransformedfrom return type- closure should return result with or without
Transformed
Another option is to introduce has_simplify so we can simplify the code to
Expr::AggregateFunction(AggregateFunction {
func_def: AggregateFunctionDefinition::UDF(udaf),
args: AggregateFunctionArgs,
}) if udaf.has_simplify() => {
Transformed::yes(udaf.simplify()?)
}
fn simplify() -> Result<Expr>
The downside is that the user needs to both define simplify and set has_simplify to true, but I think it is much simpler than the optional closure
The downside is that the user needs to both define
simplifyand sethas_simplifyto true, but I think it is much simpler than the optional closure
I personally try to avoid situations like this when creating API, this should be documented that both functions have to be implemented plus it introduces more corner cases.
Anyway, closure was a brain dump from yesterday, after a good sleep I don't think we need it, will try to elaborate in next comment
Option 1 - Original expression as a parameter
fn simplify(
&self,
expr: Expr,
) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
// we know it'll always be AggregateFunction but we have to match on it
// which is not terribly hard but it makes api a bit odd
if let Expr::AggregateFunction(agg) = expr {
...
Ok(datafusion_common::tree_node::Transformed::yes(...))
} else {
// no re-creation of expression if not used
Ok(datafusion_common::tree_node::Transformed::no(expr))
}
}
default implementation is just:
fn simplify(
&self,
expr: Expr,
) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
// no re-creation of expression if not used
Ok(datafusion_common::tree_node::Transformed::no(expr))
}
Option 2 - AggregationFunction as a parameter
fn simplify(
&self,
agg: AggregateFunction,
) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
// we know its aggregate function, no need to do extra matching, but we need to re-create expression
// issue user can return `no` with new expression in it
Ok(datafusion_common::tree_node::Transformed::no(Expr::AggregateFunction(agg)))
}
Option 3 - AggregationFunction as a parameter with parametrised ExprSimplifyResult
fn simplify(
&self,
agg: AggregateFunction,
) -> Result<ExprSimplifyResult<AggregateFunction>> {
// user cant return `no` with new expression
// simplifier will re-crete expression in case original has been returned
Ok(ExprSimplifyResult::Original(agg))
}
IMHO, I prefer option number 3, which would involve parametrizing ExprSimplifyResult, second best is option number 2.
wdyt @jayzhan211, @alamb
Option 1 - Original expression as a parameter
fn simplify( &self, expr: Expr, ) -> Result<datafusion_common::tree_node::Transformed<Expr>> { // we know it'll always be AggregateFunction but we have to match on it // which is not terribly hard but it makes api a bit odd if let Expr::AggregateFunction(agg) = expr { ... Ok(datafusion_common::tree_node::Transformed::yes(...)) } else { // no re-creation of expression if not used Ok(datafusion_common::tree_node::Transformed::no(expr)) } }default implementation is just:
fn simplify( &self, expr: Expr, ) -> Result<datafusion_common::tree_node::Transformed<Expr>> { // no re-creation of expression if not used Ok(datafusion_common::tree_node::Transformed::no(expr)) }Option 2 - AggregationFunction as a parameter
fn simplify( &self, agg: AggregateFunction, ) -> Result<datafusion_common::tree_node::Transformed<Expr>> { // we know its aggregate function, no need to do extra matching, but we need to re-create expression // issue user can return `no` with new expression in it Ok(datafusion_common::tree_node::Transformed::no(Expr::AggregateFunction(agg))) }Option 3 - AggregationFunction as a parameter with parametrised
ExprSimplifyResultfn simplify( &self, agg: AggregateFunction, ) -> Result<ExprSimplifyResult<AggregateFunction>> { // user cant return `no` with new expression // simplifier will re-crete expression in case original has been returned Ok(ExprSimplifyResult::Original(agg)) }IMHO, I prefer option number 3, which would involve parametrizing
ExprSimplifyResult, second best is option number 2. wdyt @jayzhan211, @alamb
Given these 3, I prefer 2, because I think re-create expression is not a high cost that we should avoid. And we can do even further,
If user does not implement simplify, it is equivalent to Transformed::no or ExprSimplifyResult::Original, so we can just have Result<Expr>
fn simplify(
&self,
agg: AggregateFunction,
) -> Result<Expr> {
Ok(Expr::AggregateFunction(agg))
}
I apologise for making noise, but it looks like all of the non closure options have issue with borrowing:
error[E0505]: cannot move out of `expr.0` because it is borrowed
--> datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:1391:48
|
1388 | func_def: AggregateFunctionDefinition::UDF(ref udaf),
| -------- borrow of `expr.0.func_def.0` occurs here
...
1391 | if let Expr::AggregateFunction(aggregate_function) = expr {
| ^^^^^^^^^^^^^^^^^^ move out of `expr.0` occurs here
1392 | udaf.simplify(aggregate_function, info)?
| ---- borrow later used here
It looks like AggregateArg might be back on the table, which would involve cloning udf
@jayzhan211 and @alamb whenever you get chance please have a look, IMHO I find proposal with closure to tick all the boxes, but this one is definitely simpler.
@jayzhan211 and @alamb whenever you get chance please have a look, IMHO I find proposal with closure to tick all the boxes, but this one is definitely simpler.
After playing around, I feel like maybe optional closure is our answer 🤔
fn simplify(
&self,
_info: &dyn SimplifyInfo,
) -> Option<Box<dyn Fn(AggregateArgs) -> Expr>> {}
After playing around, I feel like maybe optional closure is our answer 🤔
Damn! I should have created a branch with that code 😀, if we have consensus on that direction I'll redo it
After playing around, I feel like maybe optional closure is our answer 🤔
Damn! I should have created a branch with that code 😀, if we have consensus on that direction I'll redo it
My final answer for today is
// UDAF
fn simplify(
&self,
args: AggregateArgs,
_info: &dyn SimplifyInfo,
) -> Option<Box<dyn Fn(AggregateArgs) -> Result<Expr>>> {
// UDF, not necessary but for consistency
fn simplify(
&self,
args: Vec<Expr>,
_info: &dyn SimplifyInfo,
) -> Option<Box<dyn Fn(Vec<Expr>) -> Result<Expr>>> {
Go with optional closure!
My final answer for today is
fn simplify( &self, args: AggregateArgs, _info: &dyn SimplifyInfo, ) -> Option<Box<dyn Fn(AggregateArgs) -> Result<Expr>>> {Go with optional closure!
you mean
// UDF
fn simplify(
&self,
) -> Option<Box<dyn Fn(AggregateFunction, &dyn SimplifyInfo) -> Result<Expr>>> {
or
// UDF
fn simplify(
&self,
info: &dyn SimplifyInfo
) -> Option<Box<dyn Fn(AggregateFunction) -> Result<Expr>>> {
It should be this.
// UDF
fn simplify(
&self,
) -> Option<Box<dyn Fn(AggregateFunction, &dyn SimplifyInfo) -> Result<Expr>>> {
The reason for optional closure is that I assume we need to return Expr for simplified cases. If we expect AggregateArgs, then we might not need closure.
We're on the same page @jayzhan211
We're on the same page @jayzhan211
If we don't need Expr for simplified UDAF, than we can have
pub fn simplify(
&self,
args: AggregateArgs,
info: &dyn SimplifyInfo,
) -> Result<Transformed<AggregateArgs>> {}
I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example.
@milenkovicm
We're on the same page @jayzhan211
If we don't need
Exprfor simplified UDAF, than we can havepub fn simplify( &self, args: AggregateArgs, info: &dyn SimplifyInfo, ) -> Result<Transformed<AggregateArgs>> {}I think the next problem is whether we need
Expror justAggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it inFuncitonWrite. You can takeArrayFunctionRewriterfor example.@milenkovicm
I think we need Expr, this way we can replace function with something else
We're on the same page @jayzhan211
If we don't need
Exprfor simplified UDAF, than we can havepub fn simplify( &self, args: AggregateArgs, info: &dyn SimplifyInfo, ) -> Result<Transformed<AggregateArgs>> {}I think the next problem is whether we need
Expror justAggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it inFuncitonWrite. You can takeArrayFunctionRewriterfor example. @milenkovicmI think we need Expr, this way we can replace function with something else
I agree.
You can define closure in datafusion/expr/src/function.rs
Thanks again @jayzhan211 and @milenkovicm 🙏