[Epic] Unify `AggregateFunction` Interface (remove built in list of `AggregateFunction` s), improve the system
Is your feature request related to a problem or challenge?
For many of the same reasons as listed on https://github.com/apache/arrow-datafusion/issues/8045, having two types of aggregate functions ("built in" -- datafusion::physical_plan::aggregates::AggregateFunction) and AggregateUDF is problematic for two reasons:
- There are some features not available to User Defined Aggregate Functions (such as the faster
GroupsAccumulatorinterface) - Users can not easily choose which aggregate functions to include (for example, do they want to allow (and pay the code size / compile time) for the Statistical and Approximate functions
The second also ends up causing pushback on adding new aggregates like ARRAY_SUM in https://github.com/apache/arrow-datafusion/pull/8325 and geospatial support https://github.com/apache/arrow-datafusion/issues/7859.
Describe the solution you'd like
I propose moving DataFusion to only use AggregateUDFs and remove the built in list of AggregateFunctions for the same reasons as https://github.com/apache/arrow-datafusion/issues/8045
We will keep the existing AggregateUDF interface as much as possible, while also potentially providing an easier way to define them.
New AggregateUDF is in functions-aggregate crate
Old Aggregate functions are in datafusion/physical-expr/src/aggregate
Describe alternatives you've considered
Additional context
Proposed implementation steps:
- #10695
- https://github.com/apache/arrow-datafusion/issues/8710
- https://github.com/apache/arrow-datafusion/issues/8793
- https://github.com/apache/arrow-datafusion/issues/8984
- #9926
- Simplify API #9526
- Covariance Sample #10372
- Covariance Population #10418
- Count #10484
- Median #10644
- First and Last #9874 #10648
- Sum #10651
- Expression API #10560
- Variance Sample #10667
- Variance Population #10668
- Stddev #10827
- Approx Distinct #10851
- approx_median #10838
- approx_percentile_cont approx_percentile_cont_with_weight #10870
- regr #10883
- correlation #10884
- BitAnd / Or / Xor #10907
- Average #10942
- Min Max #10943
- String agg #10945
- ArrayAgg #10999
- BoolAndOR #11008
- https://github.com/apache/datafusion/issues/10943
- https://github.com/apache/datafusion/issues/11151
Move rust test to sqllogictest if possible #10384
Good first issue list
- grouping #10906
Pending
- nth_value
- array_agg_distinct
- array_agg_ordered
impl FromStr for AggregateFunction {
type Err = DataFusionError;
fn from_str(name: &str) -> Result<AggregateFunction> {
Ok(match name {
// general
"avg" => AggregateFunction::Avg,
"bit_and" => AggregateFunction::BitAnd,
"bit_or" => AggregateFunction::BitOr,
"bit_xor" => AggregateFunction::BitXor,
"bool_and" => AggregateFunction::BoolAnd,
"bool_or" => AggregateFunction::BoolOr,
"max" => AggregateFunction::Max,
"mean" => AggregateFunction::Avg,
"min" => AggregateFunction::Min,
"array_agg" => AggregateFunction::ArrayAgg,
"nth_value" => AggregateFunction::NthValue,
"string_agg" => AggregateFunction::StringAgg,
// statistical
"corr" => AggregateFunction::Correlation,
// other
"grouping" => AggregateFunction::Grouping,
_ => {
return plan_err!("There is no built-in function named {name}");
}
})
}
}
Feel free to file an issue if you are interested in working on any of the above in the pending list.
I will work on FirstValue UDF together with #9249
Can we introduce state_fields and fields for AggregateUDFImpl. We can see that types in AggregateUDFImpl are for building fields, why not just return fields directly, we can not only define the types but also the field name and is_nullalbe.
https://github.com/apache/arrow-datafusion/blob/e337832946fc69f6ceccd14b96a071cc2cd4693d/datafusion/physical-plan/src/udaf.rs#L86-L106
Thank you @jayzhan211 -- I am traveling this week so I am very behind on reviews. I will try and respond later this week
Thank you @jayzhan211 -- I am traveling this week so I am very behind on reviews. I will try and respond later this week
I leave the comment so I won't forget. Have a good trip 😊
I have another question. I think our goal for the aggregate function is similar to functions. we will move them into separate crate. If we need to avoid importing physical-expr. It seems we need to move some struct from physical-expr to expr, like PhysicalExpr and pub type LexOrdering = Vec<PhysicalSortExpr>;
https://github.com/apache/arrow-datafusion/blob/3eeb108125b35424baac39dd20ba88433b347419/datafusion/physical-expr/src/aggregate/first_last.rs#L44-L45
But, does moving PhysicalExpr to datafusion_expr make sense? 😕 Where/When should convert expr to physical-expr like sort_exprs to LexOrdering. Should we go through the whole process in the aggregate-functions crate like what functions do or we should separate logical-expr and physical-expr for aggregate-functions and find a way to link between them (convert from logical-expr to physical-expr)?
After https://github.com/apache/datafusion/pull/10648 and https://github.com/apache/datafusion/issues/10389 I think we have a pretty good set of examples of how to move aggregates out of the core (thanks to all the foundations layed by @jayzhan211 )
Would it be helpful to file a few "good first issue" type tickets for some of the more straightforward aggregates (I am thinking the statistical variance etc)?
After #10648 and #10389 I think we have a pretty good set of examples of how to move aggregates out of the core (thanks to all the foundations layed by @jayzhan211 )
Would it be helpful to file a few "good first issue" type tickets for some of the more straightforward aggregates (I am thinking the statistical variance etc)?
Before this, I think it would be nice to determine the expression API of aggregate function first https://github.com/apache/datafusion/pull/10560#discussion_r1611644593