datafusion
datafusion copied to clipboard
Follow-on work for `median` aggregate function
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Follow-on issue to address feedback in https://github.com/apache/arrow-datafusion/pull/3009
In particular:
- Run some performance test and see if it would be better to use builders in the aggregate state rather than arrays. Should we just use
ScalarValue::Listinstead? - Should we have a configurable limit on the number of values that can be collected to prevent OOM?
- Can we leverage the
concatandtakekernels? - Should we add an
averagekernel to arrow-rs? - Can we reduce memory overhead with some form of compression (such as run-length encoding or maintaining a map of unique values with counts)
Describe the solution you'd like See above
Describe alternatives you've considered None
Additional context None
Another optimization is that you do not need to to sort all values, only the value at the middle index needs to be at the right place. Vec::select_nth_unstable does exactly that in O(n) using a quickselect algorithm.
#7376 solves most of these