Fix: fix spill oom for sort
Which issue does this PR close?
- Closes #. https://github.com/apache/datafusion/issues/19013
Rationale for this change
When multiple operators attempt to acquire memory from the same memory pool, memory allocation failures may occur due to mutual resource contention.
What changes are included in this PR?
Make multiple attempts to call try_grow within the sort operator. Since RowCursorStream cannot perform spill-to-disk operations, the memory pool will directly trigger a grow operation instead of throwing an error when it is full; other operators that support spill-to-disk will perform spill operations afterward (assuming there is a portion of redundant memory available).
Are these changes tested?
yes
Are there any user-facing changes?
no
Is there any way to test this behavior?
I'm not sure if there's any convenient testing method available in the DataFusion library. My approach is to create 20,000 left-table files (totaling 200GB with 200 million rows) and 20,000 right-table files (totaling 20GB with 200,000 rows), then execute a query like SELECT * FROM t1 JOIN t2 ON [join_condition]. With sort-merge join enabled, the process will throw an error after running for a while.
Maybe you could set the memory limit to some lower threshold to provide the issue with a smaller dataset?
Maybe you could set the memory limit to some lower threshold to provide the issue with a smaller dataset?
Got it, got it. Thanks! By the way, this fix has already been tested with large datasets on my end and it works.