distributed
distributed copied to clipboard
Fix test_restarting_does_not_deadlock
I was looking into some of the shuffle related test failures.
I was looking into test_restarting_does_not_deadlock and noticed that the test timed out at wait_until_worker_has_tasks which appeared to be a little suspicious, i.e. one of the workers just never received any tasks (might of course be a fundamental scheduling problem, idk yet)
anyhow, I grew suspicious of our "while not condition sleep" patterns and started with migrating this to an event based utility.
cc @hendrikmakait you might be interested in this
The deadlock of the test happens with the following logs
2024-08-30 18:20:53,413 - distributed.shuffle._scheduler_plugin - WARNING - Shuffle 4b151ea2e7118ca361cbf67e2e3cbf08 initialized by task ('shuffle-transfer-4b151ea2e7118ca361cbf67e2e3cbf08', 2) executed on worker tcp://127.0.0.1:64362
2024-08-30 18:20:54,299 - distributed.shuffle._scheduler_plugin - WARNING - Shuffle 4b151ea2e7118ca361cbf67e2e3cbf08 deactivated due to stimulus 'task-finished-1725034854.297965'
so the shuffle is being deactivated prematurely because of a barrier task being finsihed / release (shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd)
The transition log is huge at this point...
transition_log.txt.zip
I think the shuffle actually finishes just way too quickly...
Transition(key=('p2pshuffle-3c0e70f475016b3312ed2e08b06af4fd', 0), start='processing', finish='memory', recommendations={('getitem-f4ac494203a9f98493521efbbad2b8ea', 0): 'processing', 'shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd': 'released'}, stimulus_id='task-finished-1725035052.8782039', timestamp=1725035052.881619), Transition(key='shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd', start='memory', finish='released', recommendations={}, stimulus_id='task-finished-1725035052.8782039', timestamp=1725035052.88164)])
I replaced the nanny with a simple Worker which is much faster since it doesn't require us to start a process. This should be sufficient
Unit Test Results
See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.
25 files ±0 25 suites ±0 10h 13m 41s :stopwatch: + 4m 41s 4 125 tests ±0 4 011 :white_check_mark: +2 111 :zzz: ±0 3 :x: - 2 47 651 runs ±0 45 543 :white_check_mark: +3 2 105 :zzz: - 1 3 :x: - 2
For more details on these failures, see this check.
Results for commit 5fc3d74f. ± Comparison against base commit 7c2134e2.
:recycle: This comment has been updated with latest results.
My above analysis was wrong. The test logic is flawed. It prepares all shuffle input data on worker A but expected transfer tasks to be scheduled on worker B. That's not necessarily guaranteed and explains why we're not hitting this condition every now and then.
I slightly rewrote the test. I also confirmed that this test triggers the deadlock condition of https://github.com/dask/distributed/issues/8088 which was the original issue this test was introduced for