distributed Fix test_restarting_does_not

I was looking into some of the shuffle related test failures.

I was looking into test_restarting_does_not_deadlock and noticed that the test timed out at wait_until_worker_has_tasks which appeared to be a little suspicious, i.e. one of the workers just never received any tasks (might of course be a fundamental scheduling problem, idk yet)

anyhow, I grew suspicious of our "while not condition sleep" patterns and started with migrating this to an event based utility.

cc @hendrikmakait you might be interested in this

Aug 30 '24 16:08 fjetter

The deadlock of the test happens with the following logs

2024-08-30 18:20:53,413 - distributed.shuffle._scheduler_plugin - WARNING - Shuffle 4b151ea2e7118ca361cbf67e2e3cbf08 initialized by task ('shuffle-transfer-4b151ea2e7118ca361cbf67e2e3cbf08', 2) executed on worker tcp://127.0.0.1:64362
2024-08-30 18:20:54,299 - distributed.shuffle._scheduler_plugin - WARNING - Shuffle 4b151ea2e7118ca361cbf67e2e3cbf08 deactivated due to stimulus 'task-finished-1725034854.297965'

so the shuffle is being deactivated prematurely because of a barrier task being finsihed / release (shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd) The transition log is huge at this point... transition_log.txt.zip

Aug 30 '24 16:08 fjetter

I think the shuffle actually finishes just way too quickly...

Transition(key=('p2pshuffle-3c0e70f475016b3312ed2e08b06af4fd', 0), start='processing', finish='memory', recommendations={('getitem-f4ac494203a9f98493521efbbad2b8ea', 0): 'processing', 'shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd': 'released'}, stimulus_id='task-finished-1725035052.8782039', timestamp=1725035052.881619), Transition(key='shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd', start='memory', finish='released', recommendations={}, stimulus_id='task-finished-1725035052.8782039', timestamp=1725035052.88164)])

Aug 30 '24 16:08 fjetter

I replaced the nanny with a simple Worker which is much faster since it doesn't require us to start a process. This should be sufficient

Aug 30 '24 16:08 fjetter

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

25 files ±0 25 suites ±0 10h 13m 41s :stopwatch: + 4m 41s 4 125 tests ±0 4 011 :white_check_mark: +2 111 :zzz: ±0 3 :x: - 2 47 651 runs ±0 45 543 :white_check_mark: +3 2 105 :zzz: - 1 3 :x: - 2

For more details on these failures, see this check.

Results for commit 5fc3d74f. ± Comparison against base commit 7c2134e2.

:recycle: This comment has been updated with latest results.

Aug 30 '24 17:08 github-actions[bot]

My above analysis was wrong. The test logic is flawed. It prepares all shuffle input data on worker A but expected transfer tasks to be scheduled on worker B. That's not necessarily guaranteed and explains why we're not hitting this condition every now and then.

Sep 02 '24 10:09 fjetter

I slightly rewrote the test. I also confirmed that this test triggers the deadlock condition of https://github.com/dask/distributed/issues/8088 which was the original issue this test was introduced for

Sep 02 '24 10:09 fjetter

distributed
distributed copied to clipboard

Fix test_restarting_does_not_deadlock

Unit Test Results

distributed distributed copied to clipboard

Fix test_restarting_does_not_deadlock

Unit Test Results

distributed
distributed copied to clipboard