datafusion
datafusion copied to clipboard
Remove special casting of `Min` / `Max` built in `AggregateFunctions`
Is your feature request related to a problem or challenge?
Part of https://github.com/apache/datafusion/issues/10943
While trying to port min and max to UDFs in https://github.com/apache/datafusion/pull/11013, @edmondop found several places where Min and Max (the existing built in aggregate functions) are special cased
Here is Max:
https://github.com/apache/datafusion/blob/c2ea6b34aaf603737f42753123662061b86c1ee5/datafusion/physical-expr/src/aggregate/min_max.rs#L77-L82
The problem with relying on Min and Max directly is that it is hard/impossible to switch Min/Max to be a user defined aggregates as seen in https://github.com/apache/datafusion/pull/11013
Also, relying on Min/Max directly means that there is certain behavior that is not available to UDAFs compared to build in aggregate functions, which isn't ideal
Describe the solution you'd like
Remove all explicit references to Min / Max
Specifically, code that looks like this should be removed
expr.as_any().downcast_ref::<Max>()
and
expr.as_any().downcast_ref::<Min>()
Describe alternatives you've considered
Ideally, the alternative is to use a function on AggregateExpr which we can then add/implement for UDAFs when we port min/max to be a UDAF
Task List
- [x] https://github.com/apache/datafusion/issues/11152
- [ ] https://github.com/apache/datafusion/issues/11153
Additional context
No response
Do you think we should implement a special is_min/is_max since these are pervasive and used also for statistics ?
Do you think we should implement a special is_min/is_max since these are pervasive and used also for statistics ?
It might make sense to special case is_min and is_max
What we did in the case of ScalarUDFs was to go with functions that described the property in question was rather than what the function was. So for example, for coalesce and some array functions which needed special case type coercion rules, @jayzhan211 added a way to provide user defined coercion rules, rather than a is_coalesce function
I was thinking we would do something similar in this case. However, it if turns out that all such properties are only relevant for min/max maybe having a is_min / is_max API would be better than enumerating all the properties
@alamb I think this has been solved?
This is not, we need to address the specialization check in aggregate_statistic
Specifically I think these checks need to be removed: https://github.com/apache/datafusion/blob/4838cfbf453f3c21d9c5a84f9577329dd78aa763/datafusion/physical-optimizer/src/aggregate_statistics.rs#L269-L292
Here is one specific suggestion: https://github.com/apache/datafusion/pull/12296#discussion_r1747254563