datafusion
datafusion copied to clipboard
[EPIC] A collection of `Sort + Limit` / `Top K` optimizations
Is your feature request related to a problem or challenge?
This ticket has links a collection of various ways to make queries with LIMIT or various other variants (like row_number() predicates) both:
- Go faster
- Use less memory
These are typically called "Top K" style optimizations in databases and optimize the pattern of a sort followed by a limit
LIMIT(fetch = 10)
SORT(x)
INPUT...
The observation is that if the INPUT is much larger than the fetch (aka the K) it is much more efficient and less memory intensive to track the top 10 values rather than sort the entire input and discard everything except the top 10
Normally this done with special ExecutionPlan operators. What the operators do and behave depend on the exact query pattern.
Describe the solution you'd like
- [x] https://github.com/apache/arrow-datafusion/issues/7196
- [x] https://github.com/apache/arrow-datafusion/issues/7149
- [ ] https://github.com/apache/arrow-datafusion/issues/6937
- [x] https://github.com/apache/arrow-datafusion/issues/7198
- [x] https://github.com/apache/arrow-datafusion/issues/7064
- [ ] https://github.com/apache/arrow-datafusion/issues/6899
- [x] https://github.com/apache/arrow-datafusion/issues/7191
- [ ] https://github.com/apache/arrow-datafusion/issues/2365
- [x] https://github.com/apache/arrow-datafusion/issues/7162
- [ ] https://github.com/apache/arrow-datafusion/issues/3579
- [ ] https://github.com/apache/arrow-datafusion/issues/900
Describe alternatives you've considered
No response
Additional context
No response