datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Unbounded SortExec (and Top-K) Implementation When Req's Are Satisfied

Open berkaysynnada opened this issue 1 year ago • 3 comments

Which issue does this PR close?

Closes #.

Rationale for this change

SortExec (with or w/out fetch) can work without an actual sort if the existing input order is required.

What changes are included in this PR?

  1. When we create a new SortExec, its execution mode is set according to its input order.
  2. Adding a fetch can update its execution mode.
  3. During execution, we consult the information of matching the requirements and having a fetch. If we already satisfy the requirements, SortExec either short-circuits itself (if it does not have a fetch), or behaves as a limit operator (if it has a fetch).

Are these changes tested?

Yes. I cannot practice the newly added stream types via an .slt test, but a unit test is added.

Are there any user-facing changes?

berkaysynnada avatar Aug 26 '24 13:08 berkaysynnada

But in this case I would have expected the Sort to have been removed by one of the optimizer passes rather than the SortExec implementing a limit. 🤔

Right. The sort will actually be entirely removed in later passes. However, in the meantime (before its removal), any rule looking at the execution mode will still think this TopK is pipeline-breaking (while it is not), which results in them behaving incorrectly. We caught this behavior downstream in the context of a custom rule.

Basically, this PR does two things:

  1. It makes the execution_mode API give a correct result when data is already sorted, so intermediate rules that look at execution modes do not falsely think that the pipeline breaks somewhere.
  2. It makes TopK's execution logic behave like a limit if it somehow ends up in the plan without being removed. This is not possible in upstream DataFusion (sans bugs), but people using custom optimizers can end up in such a state. In such cases, it will also execute more efficiently.

ozankabak avatar Aug 28 '24 12:08 ozankabak

Thanks for taking another look -- incorporated your feedback to leverage LimitStream

ozankabak avatar Aug 28 '24 17:08 ozankabak

I will go ahead and merge this soon since it is a small, localized change that improves corner cases without interfering with the main use case of SortExec.

If there are any lingering concerns, we will address with a quick follow-on PR

ozankabak avatar Aug 29 '24 05:08 ozankabak

Thank you for responding to the feedback @berkaysynnada and @ozankabak -- sorry for my delay -- I have been out. This PR now looks good to me

alamb avatar Sep 01 '24 12:09 alamb