datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Follow-on work for `median` aggregate function

Open andygrove opened this issue 3 years ago • 1 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Follow-on issue to address feedback in https://github.com/apache/arrow-datafusion/pull/3009

In particular:

  • Run some performance test and see if it would be better to use builders in the aggregate state rather than arrays. Should we just use ScalarValue::List instead?
  • Should we have a configurable limit on the number of values that can be collected to prevent OOM?
  • Can we leverage the concat and take kernels?
  • Should we add an average kernel to arrow-rs?
  • Can we reduce memory overhead with some form of compression (such as run-length encoding or maintaining a map of unique values with counts)

Describe the solution you'd like See above

Describe alternatives you've considered None

Additional context None

andygrove avatar Aug 05 '22 13:08 andygrove

Another optimization is that you do not need to to sort all values, only the value at the middle index needs to be at the right place. Vec::select_nth_unstable does exactly that in O(n) using a quickselect algorithm.

jhorstmann avatar Aug 09 '22 14:08 jhorstmann

#7376 solves most of these

Dandandan avatar Aug 23 '23 09:08 Dandandan