datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Implement spilling for PartialSortExec

Open alamb opened this issue 1 year ago • 2 comments

Is your feature request related to a problem or challenge?

PartialSortExec was added in https://github.com/apache/arrow-datafusion/issues/7456 / https://github.com/apache/arrow-datafusion/pull/9125

While one of the major benefits of this operator is to reduce memory required when sorting data (as it can emit early) we should also handle the case when it still can not fit everything in

Describe the solution you'd like

Add spilling support to PartialSortExec so that if it runs out of memory it will spill to disk rather than error

Describe alternatives you've considered

No response

Additional context

https://github.com/apache/arrow-datafusion/issues/9153 tracks enabling PartialSort for more queries

alamb avatar Feb 09 '24 01:02 alamb

I want to help it. Though it seems not a small project, I think there's spilling implementation in SortExec and I can learn from that.

yyy1000 avatar Feb 15 '24 16:02 yyy1000

Thanks @yyy1000 -- I would definitely recommend

  1. Studying the existing implementation in Sort
  2. Creating a test case that shows the sort being invoked (aka set memory manager low and create a partial sort plan)
  3. Try and refactor / adapt the parts used in sort to also be used in partial sort

alamb avatar Feb 16 '24 13:02 alamb