distributed
distributed copied to clipboard
Remove resumed state
The resumed state is exceptionally complicated and a very frequent source of problems. This PR removes the 'resumed' state and the TaskState.next attribute.
This PR also deals with the issue of tasks with waiters that transition to error - which is typical, but not exclusive, of the cancelled state - see issues below. The waiters are now sent back to the scheduler.
- Mutually exclusive with #6699
- Mutually exclusive with #6844
- Closes #6682
- Closes #6689
- Closes #6693
- Closes #6685
- Closes #6705
Design
| Events | Main | This PR |
|---|---|---|
| compute[1] free-keys |
status=cancelled previous=executing On completion: quietly release |
(same) |
| fetch[2] free-keys |
status=cancelled previous=flight On completion: quietly release |
(same) |
| compute[1] free-keys compute[1] |
status=executing | (same) |
| fetch[2] free-keys fetch[2] |
status=flight | (same) |
| compute[1] free-keys fetch[2] |
status=resumed previous=executing next=fetch On success: add-keys On compute failure: cluster deadlock (#6689) On reschedule: InvalidTransition (#6685) |
status=executing previous=None On success: task-finished On compute failure: task-erred[5]; reschedule dependents[6] On reschedule: reschedule task and its dependents[6] |
| fetch[2] free-keys compute[1] |
status=resumed previous=flight next=waiting On success: task-finished with bogus metrics On peer failure [3]: transition to waiting On (un)pickle failure [4]: task-erred |
status=flight previous=None On success: add-keys On peer failure [3]: transition to fetch or missing; on the scheduler side, request_who_has reschedules On (un)pickle failure [4]: task-erred |
| fetch[2] | status=flight On success: add-keys On peer failure [3]: transition to fetch or missing On (un)pickle failure [4]: cluster deadlock (#6705) |
status=flight On success: add-keys On peer failure [3]: transition to fetch or missing On (un)pickle failure [4]: task-erred[5] and reschedule dependents[6] |
Notes
[1] ComputeTaskEvent(key=<key>) [2] ComputeTaskEvent(who_has={<key>: [...]} or AcquireReplicasEvent(who_has={<key>: [...]}) [3] GatherDepSuccessEvent without the requested key or GatherDepNetworkFailureEvent [4] GatherDepFailureEvent, typically caused by a failure to unpickle, or GatherDepSuccessEvent for a task that is larger than 60% max_memory, is thus spilled immediately, and fails to pickle. [5] The task-erred messages introduce a new scheduler-side use case, where the scheduler receives a task-erred message for a task that is already in memory. At the moment, this use case is a no-op.
- if ts.retries > 0, decrease it by one and, on the reporting worker only, release the task and all its dependents
- if ts.retries = 0, transition the task to erred, release the task on the workers where it is in memory, and release all dependents everywhere.
[6] rescheduling waiters implies introducing a new waiting->rescheduled transition
TODO
- The implementation does not fully reflect the above design yet
- Failing tests:
- FAILED distributed/tests/test_failed_workers.py::test_submit_after_failed_worker_sync
- FAILED distributed/tests/test_failed_workers.py::test_submit_after_failed_worker_async[False]
- FAILED distributed/tests/test_worker.py::test_task_flight_compute_oserror - a...
- The whole test_cancelled_state.py (I didn't look at it yet)
- Write more tests for waiters being sent back to the scheduler
- Ensure all tickets above meet DoD
Regardless of TODOs, this is a gargantuan change which won't go in before @fjetter has come back and has had the time to thoroughly review it.
Unit Test Results
See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.
15 files ± 0 15 suites ±0 6h 44m 54s :stopwatch: - 21m 1s 3 044 tests + 44 1 158 :heavy_check_mark: - 1 750 33 :zzz: - 56 109 :x: +106 1 744 :fire: +1 744 22 501 runs +255 15 729 :heavy_check_mark: - 5 463 790 :zzz: - 261 771 :x: +768 5 211 :fire: +5 211
For more details on these failures and errors, see this check.
Results for commit 2e48ace6. ± Comparison against base commit 1d0701b5.
:recycle: This comment has been updated with latest results.
Not sure how best to add comments to the above table. I added another row with comments.
| Events | Main | This PR | Comment |
|---|---|---|---|
| compute[1] free-keys fetch[2] |
status=resumed previous=executing next=fetch On success: add-keys On compute failure: cluster deadlock (#6689) On reschedule: InvalidTransition (#6685) |
status=executing previous=None On success: task-finished On compute failure: task-erred[5]; reschedule dependents[6] On reschedule: reschedule task and its dependents[6] |
About compute failure: Why would we even send anything in this case? Why would we need to reschedule any dependents? |
| fetch[2] free-keys compute[1] |
status=resumed previous=flight next=waiting On success: task-finished with bogus metrics On peer failure [3]: transition to waiting On (un)pickle failure [4]: task-erred |
status=flight previous=None On success: add-keys On peer failure [3]: transition to fetch or missing; on the scheduler side, request_who_has reschedules On (un)pickle failure [4]: task-erred |
On peer failure - This is actually the only sane use case of this state. All other combinations are mostly artifacts due to race conditions. The fundamental intention of the resumed state is "wait for the fetch to time out and start to compute the task afterwards". A transition to fetch after this seems just wrong since the most recent scheduler request is clearly compute. Same for the state missing. How can a "computable task" be missing? I think by transitioning to fetch or missing we'd mix up semantics. I also don't see how the system would actually end up computing the task in the end. Can you elaborate more how this would work out? |
| fetch[2] | status=flight On success: add-keys On peer failure [3]: transition to fetch or missing On (un)pickle failure [4]: cluster deadlock (#6705) |
status=flight On success: add-keys On peer failure [3]: transition to fetch or missing On (un)pickle failure [4]: task-erred[5] and reschedule dependents[6] |
I believe this is a known issue https://github.com/dask/distributed/issues/4439 I don't see how this connects to the resumed state. While not ideal, I do not consider this a major issue which would warrant such a big change |
I fully acknowledge that resumed is not an intuitive state. However, the table above also doesn't feel very intuitive to me and I'd like to discuss intended behavior a bit more thoroughly before we commit on this approach.
I only skimmed the implementation. It doesn't look as bad as I thought but based on the table I would've expected some changes to the scheduler as well which makes me a bit nervous.
FWIW, as already outlined above, I believe the most important real world use case is fetch->free-keys->compute and we should really get this one right. IIUC this is working on main
I'll have a closer look at the code and will provide more feedback about this proposal. I'm currently a bit skeptical about removing it since I think we need it as a complement to cancelled and I don't think we should remove cancelled, see https://github.com/dask/distributed/pull/6844#pullrequestreview-1064912014