datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Remove special casting of `Min` / `Max` built in `AggregateFunctions`

Open alamb opened this issue 1 year ago • 4 comments

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/10943

While trying to port min and max to UDFs in https://github.com/apache/datafusion/pull/11013, @edmondop found several places where Min and Max (the existing built in aggregate functions) are special cased

Here is Max: https://github.com/apache/datafusion/blob/c2ea6b34aaf603737f42753123662061b86c1ee5/datafusion/physical-expr/src/aggregate/min_max.rs#L77-L82

The problem with relying on Min and Max directly is that it is hard/impossible to switch Min/Max to be a user defined aggregates as seen in https://github.com/apache/datafusion/pull/11013

Also, relying on Min/Max directly means that there is certain behavior that is not available to UDAFs compared to build in aggregate functions, which isn't ideal

Describe the solution you'd like

Remove all explicit references to Min / Max

Specifically, code that looks like this should be removed

       expr.as_any().downcast_ref::<Max>()

and

       expr.as_any().downcast_ref::<Min>()

Describe alternatives you've considered

Ideally, the alternative is to use a function on AggregateExpr which we can then add/implement for UDAFs when we port min/max to be a UDAF

Task List

  • [x] https://github.com/apache/datafusion/issues/11152
  • [ ] https://github.com/apache/datafusion/issues/11153

Additional context

No response

alamb avatar Jun 27 '24 23:06 alamb

Do you think we should implement a special is_min/is_max since these are pervasive and used also for statistics ?

edmondop avatar Jun 29 '24 13:06 edmondop

Do you think we should implement a special is_min/is_max since these are pervasive and used also for statistics ?

It might make sense to special case is_min and is_max

What we did in the case of ScalarUDFs was to go with functions that described the property in question was rather than what the function was. So for example, for coalesce and some array functions which needed special case type coercion rules, @jayzhan211 added a way to provide user defined coercion rules, rather than a is_coalesce function

I was thinking we would do something similar in this case. However, it if turns out that all such properties are only relevant for min/max maybe having a is_min / is_max API would be better than enumerating all the properties

alamb avatar Jun 30 '24 10:06 alamb

@alamb I think this has been solved?

edmondop avatar Aug 28 '24 17:08 edmondop

This is not, we need to address the specialization check in aggregate_statistic

jayzhan211 avatar Aug 28 '24 23:08 jayzhan211

Specifically I think these checks need to be removed: https://github.com/apache/datafusion/blob/4838cfbf453f3c21d9c5a84f9577329dd78aa763/datafusion/physical-optimizer/src/aggregate_statistics.rs#L269-L292

alamb avatar Sep 02 '24 11:09 alamb

Here is one specific suggestion: https://github.com/apache/datafusion/pull/12296#discussion_r1747254563

alamb avatar Sep 06 '24 14:09 alamb