datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Fix: fix spill oom for sort

Open xxhZs opened this issue 1 month ago • 1 comments

Which issue does this PR close?

  • Closes #. https://github.com/apache/datafusion/issues/19013

Rationale for this change

When multiple operators attempt to acquire memory from the same memory pool, memory allocation failures may occur due to mutual resource contention.

What changes are included in this PR?

Make multiple attempts to call try_grow within the sort operator. Since RowCursorStream cannot perform spill-to-disk operations, the memory pool will directly trigger a grow operation instead of throwing an error when it is full; other operators that support spill-to-disk will perform spill operations afterward (assuming there is a portion of redundant memory available).

Are these changes tested?

yes

Are there any user-facing changes?

no

xxhZs avatar Dec 02 '25 07:12 xxhZs

Is there any way to test this behavior?

I'm not sure if there's any convenient testing method available in the DataFusion library. My approach is to create 20,000 left-table files (totaling 200GB with 200 million rows) and 20,000 right-table files (totaling 20GB with 200,000 rows), then execute a query like SELECT * FROM t1 JOIN t2 ON [join_condition]. With sort-merge join enabled, the process will throw an error after running for a while.

xxhZs avatar Dec 11 '25 06:12 xxhZs

Maybe you could set the memory limit to some lower threshold to provide the issue with a smaller dataset?

alamb avatar Dec 11 '25 22:12 alamb

Maybe you could set the memory limit to some lower threshold to provide the issue with a smaller dataset?

Got it, got it. Thanks! By the way, this fix has already been tested with large datasets on my end and it works.

xxhZs avatar Dec 12 '25 09:12 xxhZs