datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add min_by and max_by aggregate functions

Open timsaucer opened this issue 1 year ago • 8 comments

Is your feature request related to a problem or challenge?

It is a common need to get the value of one column such that another column is a minimum. For example, if I have a column of fruit_name and price_per_pound I might want to get the fruit_name for which price_per_pound is a minimum. This can be done with existing functions, but it should be both more performant and more user friendly to add these functions.

Describe the solution you'd like

Ideally this would take two expressions. The first would be the expression you want to return and the second would be the expression that we are looking for the min/max of. In my example we would do something like

min_by(col("fruit_name"), col("price_per_pound"))

Describe alternatives you've considered

Right now I would probably do a first_value with an order_by. This will introduce an unnecessary sort of the dataframe.

Additional context

Example from

https://docs.databricks.com/en/sql/language-manual/functions/min_by.html

timsaucer avatar Aug 20 '24 10:08 timsaucer

We have similar special functions for this type of function in InfluxDB (needed to get the max by time, for example) in case that helps this

We call them selector functions. Here are the docs https://docs.influxdata.com/influxdb/cloud-serverless/reference/sql/functions/selector/

Here is the code https://github.com/influxdata/influxdb3_core/blob/main/query_functions/src/selectors/internal.rs

alamb avatar Aug 22 '24 15:08 alamb

There are already aggregation variants of first/last which seem to solve the issue (example), and, at first glance, they do not perform normal sorting, only compare incoming ordering column values with "accumulated" ones.

It also seems that their performance could be improved by implementing GroupsAccumulator for them (or for majority of input types at least), as their state is somehow similar to AVGs state in terms of comlexity.

korowa avatar Aug 25 '24 10:08 korowa

FIRST_VALUE / LAST_VALUE with ORDER BY is a great call. Here is the code if anyone is interested:

https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/nth_value.rs

alamb avatar Aug 25 '24 10:08 alamb

More like https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/first_last.rs (I suppose)

korowa avatar Aug 25 '24 11:08 korowa

My bad, it still sorts input data (during e.g. get_first_idx, which is likely a redundant sort, except for the cases when input is already sorted by ordering columns). That is another thing to improve.

korowa avatar Aug 25 '24 11:08 korowa

As I'm thinking about it, I'm not sure you can get around doing the sort since you have an arbitrary number of ordering clauses. I think what you've proposed is the best option. One could speed up the operation when there's only a single order clause but that may not be worth the effort involved.

I'm okay closing this issue unless anyone thinks there's value in pursuing it further.

timsaucer avatar Aug 25 '24 12:08 timsaucer

As I'm thinking about it, I'm not sure you can get around doing the sort since you have an arbitrary number of ordering clauses. I think what you've proposed is the best option. One could speed up the operation when there's only a single order clause but that may not be worth the effort involved.

FWIW this is what the first_selector / last_selector functions do in InfluxDB -- they basically maintain only the value for the largest ORDER BY column, rather than actually sorting the entire intermediate result

I'm okay closing this issue unless anyone thinks there's value in pursuing it further.

It would be nice to make it clearer how to get the equivalent of min_by and max_by -- maybe we could just document the equivalent first_value(... ORDER BY ...) queries 🤔 or we could automatically rewrite min_by / max_by into first_value() / last_value 🤔

It is probably also worth filing a ticket for the potential optimization

alamb avatar Aug 25 '24 13:08 alamb

I guess user guide will be the best option since naming for this function may vary in different DBs -- min_by / argmin / argmin_agg, and maybe couple of others, so aliasing is not that straightforward.

korowa avatar Aug 25 '24 16:08 korowa

I think we can close this ticket

dharanad avatar Sep 27 '24 15:09 dharanad

Thanks everyone

alamb avatar Sep 28 '24 16:09 alamb