distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Fix test_restarting_does_not_deadlock

Open fjetter opened this issue 1 year ago • 6 comments

I was looking into some of the shuffle related test failures.

I was looking into test_restarting_does_not_deadlock and noticed that the test timed out at wait_until_worker_has_tasks which appeared to be a little suspicious, i.e. one of the workers just never received any tasks (might of course be a fundamental scheduling problem, idk yet)

anyhow, I grew suspicious of our "while not condition sleep" patterns and started with migrating this to an event based utility.

cc @hendrikmakait you might be interested in this

fjetter avatar Aug 30 '24 16:08 fjetter

The deadlock of the test happens with the following logs

2024-08-30 18:20:53,413 - distributed.shuffle._scheduler_plugin - WARNING - Shuffle 4b151ea2e7118ca361cbf67e2e3cbf08 initialized by task ('shuffle-transfer-4b151ea2e7118ca361cbf67e2e3cbf08', 2) executed on worker tcp://127.0.0.1:64362
2024-08-30 18:20:54,299 - distributed.shuffle._scheduler_plugin - WARNING - Shuffle 4b151ea2e7118ca361cbf67e2e3cbf08 deactivated due to stimulus 'task-finished-1725034854.297965'

so the shuffle is being deactivated prematurely because of a barrier task being finsihed / release (shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd) The transition log is huge at this point... transition_log.txt.zip

fjetter avatar Aug 30 '24 16:08 fjetter

I think the shuffle actually finishes just way too quickly...

Transition(key=('p2pshuffle-3c0e70f475016b3312ed2e08b06af4fd', 0), start='processing', finish='memory', recommendations={('getitem-f4ac494203a9f98493521efbbad2b8ea', 0): 'processing', 'shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd': 'released'}, stimulus_id='task-finished-1725035052.8782039', timestamp=1725035052.881619), Transition(key='shuffle-barrier-3c0e70f475016b3312ed2e08b06af4fd', start='memory', finish='released', recommendations={}, stimulus_id='task-finished-1725035052.8782039', timestamp=1725035052.88164)])

fjetter avatar Aug 30 '24 16:08 fjetter

I replaced the nanny with a simple Worker which is much faster since it doesn't require us to start a process. This should be sufficient

fjetter avatar Aug 30 '24 16:08 fjetter

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    25 files  ±0      25 suites  ±0   10h 13m 41s :stopwatch: + 4m 41s  4 125 tests ±0   4 011 :white_check_mark: +2    111 :zzz: ±0  3 :x:  - 2  47 651 runs  ±0  45 543 :white_check_mark: +3  2 105 :zzz:  - 1  3 :x:  - 2 

For more details on these failures, see this check.

Results for commit 5fc3d74f. ± Comparison against base commit 7c2134e2.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Aug 30 '24 17:08 github-actions[bot]

My above analysis was wrong. The test logic is flawed. It prepares all shuffle input data on worker A but expected transfer tasks to be scheduled on worker B. That's not necessarily guaranteed and explains why we're not hitting this condition every now and then.

fjetter avatar Sep 02 '24 10:09 fjetter

I slightly rewrote the test. I also confirmed that this test triggers the deadlock condition of https://github.com/dask/distributed/issues/8088 which was the original issue this test was introduced for

fjetter avatar Sep 02 '24 10:09 fjetter