datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

chore: Reserve memory for native shuffle writer per partition

Open viirya opened this issue 1 year ago • 5 comments
trafficstars

Which issue does this PR close?

Closes #887.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

viirya avatar Sep 30 '24 20:09 viirya

I've copied the tests on my branch to this PR and the test hangs:

running 6 tests
test execution::datafusion::shuffle_writer::test::test_slot_size ... ok
test execution::datafusion::shuffle_writer::test::test_pmod ... ok
test execution::datafusion::shuffle_writer::test::test_insert_larger_batch ... ok
test execution::datafusion::shuffle_writer::test::test_insert_smaller_batch ... ok
test execution::datafusion::shuffle_writer::test::test_large_number_of_partitions has been running for over 60 seconds
test execution::datafusion::shuffle_writer::test::test_large_number_of_partitions_spilling has been running for over 60 seconds
^C

It is possibly caused by deadlocking on buffered_partitions.lock() when spilling is triggered.

Kontinuation avatar Oct 01 '24 06:10 Kontinuation

Thanks. I knew the cause of the deadlocks. I'm going to revamp some codes.

viirya avatar Oct 01 '24 23:10 viirya

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 33.97%. Comparing base (c3023c5) to head (e678cb0). Report is 25 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #988      +/-   ##
============================================
- Coverage     34.03%   33.97%   -0.07%     
+ Complexity      875      857      -18     
============================================
  Files           112      112              
  Lines         43289    43426     +137     
  Branches       9572     9622      +50     
============================================
+ Hits          14734    14752      +18     
- Misses        25521    25630     +109     
- Partials       3034     3044      +10     
Flag Coverage Δ
?

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Oct 09 '24 18:10 codecov-commenter

Hmm, these tests for large partition number shuffle fail on MacOS runners only. And no stack trace...But I cannot reproduce it locally.

viirya avatar Oct 09 '24 21:10 viirya

Okay, it is the error I expected before:

ret: Err(ArrowError(ExternalError(IoError(Custom { kind: Uncategorized, error: PathError { path: "/var/folders/t_/mmhnh941511_hp2lwh383bp00000gn/T/.tmpQv8o2b/.tmpioYozN", err: Os { code: 24, kind: Uncategorized, message: "Too many open files" } } })), None))

But I had increase it by ulimit. It doesn't help.

viirya avatar Oct 09 '24 21:10 viirya

I'm testing this PR out now, in conjunction with some other PRs because I currently have a reproducible deadlock caused by memory pool issues, as far as I can tell.

andygrove avatar Oct 11 '24 18:10 andygrove

Thanks @andygrove @Kontinuation

viirya avatar Oct 14 '24 15:10 viirya