datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[EPIC] A collection of `Sort + Limit` / `Top K` optimizations

Open alamb opened this issue 2 years ago • 0 comments

Is your feature request related to a problem or challenge?

This ticket has links a collection of various ways to make queries with LIMIT or various other variants (like row_number() predicates) both:

  1. Go faster
  2. Use less memory

These are typically called "Top K" style optimizations in databases and optimize the pattern of a sort followed by a limit

LIMIT(fetch = 10)
  SORT(x)
    INPUT...

The observation is that if the INPUT is much larger than the fetch (aka the K) it is much more efficient and less memory intensive to track the top 10 values rather than sort the entire input and discard everything except the top 10

Normally this done with special ExecutionPlan operators. What the operators do and behave depend on the exact query pattern.

Describe the solution you'd like

  • [x] https://github.com/apache/arrow-datafusion/issues/7196
  • [x] https://github.com/apache/arrow-datafusion/issues/7149
  • [ ] https://github.com/apache/arrow-datafusion/issues/6937
  • [x] https://github.com/apache/arrow-datafusion/issues/7198
  • [x] https://github.com/apache/arrow-datafusion/issues/7064
  • [ ] https://github.com/apache/arrow-datafusion/issues/6899
  • [x] https://github.com/apache/arrow-datafusion/issues/7191
  • [ ] https://github.com/apache/arrow-datafusion/issues/2365
  • [x] https://github.com/apache/arrow-datafusion/issues/7162
  • [ ] https://github.com/apache/arrow-datafusion/issues/3579
  • [ ] https://github.com/apache/arrow-datafusion/issues/900

Describe alternatives you've considered

No response

Additional context

No response

alamb avatar Aug 04 '23 18:08 alamb