Is your feature request related to a problem or challenge?

For many of the same reasons as listed on https://github.com/apache/arrow-datafusion/issues/8045, having two types of aggregate functions ("built in" -- datafusion::physical_plan::aggregates::AggregateFunction) and AggregateUDF is problematic for two reasons:

There are some features not available to User Defined Aggregate Functions (such as the faster GroupsAccumulator interface)
Users can not easily choose which aggregate functions to include (for example, do they want to allow (and pay the code size / compile time) for the Statistical and Approximate functions

The second also ends up causing pushback on adding new aggregates like ARRAY_SUM in https://github.com/apache/arrow-datafusion/pull/8325 and geospatial support https://github.com/apache/arrow-datafusion/issues/7859.

Describe the solution you'd like

I propose moving DataFusion to only use AggregateUDFs and remove the built in list of AggregateFunctions for the same reasons as https://github.com/apache/arrow-datafusion/issues/8045

We will keep the existing AggregateUDF interface as much as possible, while also potentially providing an easier way to define them.

New AggregateUDF is in functions-aggregate crate Old Aggregate functions are in datafusion/physical-expr/src/aggregate

Describe alternatives you've considered

Additional context

Proposed implementation steps:

#10695
https://github.com/apache/arrow-datafusion/issues/8710
https://github.com/apache/arrow-datafusion/issues/8793
https://github.com/apache/arrow-datafusion/issues/8984
#9926
Simplify API #9526
Covariance Sample #10372
Covariance Population #10418
Count #10484
Median #10644
First and Last #9874 #10648
Sum #10651
Expression API #10560
Variance Sample #10667
Variance Population #10668
Stddev #10827
Approx Distinct #10851
approx_median #10838
approx_percentile_cont approx_percentile_cont_with_weight #10870
regr #10883
correlation #10884
BitAnd / Or / Xor #10907
Average #10942
Min Max #10943
String agg #10945
ArrayAgg #10999
BoolAndOR #11008
https://github.com/apache/datafusion/issues/10943
https://github.com/apache/datafusion/issues/11151

Move rust test to sqllogictest if possible #10384

Good first issue list

grouping #10906

Pending

nth_value
array_agg_distinct
array_agg_ordered

impl FromStr for AggregateFunction {
    type Err = DataFusionError;
    fn from_str(name: &str) -> Result<AggregateFunction> {
        Ok(match name {
            // general
            "avg" => AggregateFunction::Avg,
            "bit_and" => AggregateFunction::BitAnd,
            "bit_or" => AggregateFunction::BitOr,
            "bit_xor" => AggregateFunction::BitXor,
            "bool_and" => AggregateFunction::BoolAnd,
            "bool_or" => AggregateFunction::BoolOr,
            "max" => AggregateFunction::Max,
            "mean" => AggregateFunction::Avg,
            "min" => AggregateFunction::Min,
            "array_agg" => AggregateFunction::ArrayAgg,
            "nth_value" => AggregateFunction::NthValue,
            "string_agg" => AggregateFunction::StringAgg,
            // statistical
            "corr" => AggregateFunction::Correlation,
            // other
            "grouping" => AggregateFunction::Grouping,
            _ => {
                return plan_err!("There is no built-in function named {name}");
            }
        })
    }
}

Feel free to file an issue if you are interested in working on any of the above in the pending list.

Jan 01 '24 18:01 alamb

I will work on FirstValue UDF together with #9249

Mar 26 '24 01:03 jayzhan211

Can we introduce state_fields and fields for AggregateUDFImpl. We can see that types in AggregateUDFImpl are for building fields, why not just return fields directly, we can not only define the types but also the field name and is_nullalbe.

https://github.com/apache/arrow-datafusion/blob/e337832946fc69f6ceccd14b96a071cc2cd4693d/datafusion/physical-plan/src/udaf.rs#L86-L106

Mar 27 '24 00:03 jayzhan211

Thank you @jayzhan211 -- I am traveling this week so I am very behind on reviews. I will try and respond later this week

Mar 27 '24 02:03 alamb

Thank you @jayzhan211 -- I am traveling this week so I am very behind on reviews. I will try and respond later this week

I leave the comment so I won't forget. Have a good trip 😊

Mar 27 '24 04:03 jayzhan211

I have another question. I think our goal for the aggregate function is similar to functions. we will move them into separate crate. If we need to avoid importing physical-expr. It seems we need to move some struct from physical-expr to expr, like PhysicalExpr and pub type LexOrdering = Vec<PhysicalSortExpr>;

https://github.com/apache/arrow-datafusion/blob/3eeb108125b35424baac39dd20ba88433b347419/datafusion/physical-expr/src/aggregate/first_last.rs#L44-L45

But, does moving PhysicalExpr to datafusion_expr make sense? 😕 Where/When should convert expr to physical-expr like sort_exprs to LexOrdering. Should we go through the whole process in the aggregate-functions crate like what functions do or we should separate logical-expr and physical-expr for aggregate-functions and find a way to link between them (convert from logical-expr to physical-expr)?

Mar 29 '24 13:03 jayzhan211

After https://github.com/apache/datafusion/pull/10648 and https://github.com/apache/datafusion/issues/10389 I think we have a pretty good set of examples of how to move aggregates out of the core (thanks to all the foundations layed by @jayzhan211 )

Would it be helpful to file a few "good first issue" type tickets for some of the more straightforward aggregates (I am thinking the statistical variance etc)?

May 25 '24 09:05 alamb

After #10648 and #10389 I think we have a pretty good set of examples of how to move aggregates out of the core (thanks to all the foundations layed by @jayzhan211 )

Would it be helpful to file a few "good first issue" type tickets for some of the more straightforward aggregates (I am thinking the statistical variance etc)?

Before this, I think it would be nice to determine the expression API of aggregate function first https://github.com/apache/datafusion/pull/10560#discussion_r1611644593

May 25 '24 13:05 jayzhan211

[Epic] Unify `AggregateFunction` Interface (remove built in list of `AggregateFunction` s), improve the system

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Good first issue list

Pending