distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Remove resumed state

Open crusaderky opened this issue 3 years ago • 3 comments

The resumed state is exceptionally complicated and a very frequent source of problems. This PR removes the 'resumed' state and the TaskState.next attribute.

This PR also deals with the issue of tasks with waiters that transition to error - which is typical, but not exclusive, of the cancelled state - see issues below. The waiters are now sent back to the scheduler.

  • Mutually exclusive with #6699
  • Mutually exclusive with #6844
  • Closes #6682
  • Closes #6689
  • Closes #6693
  • Closes #6685
  • Closes #6705

Design

Events Main This PR
compute[1]
free-keys
status=cancelled
previous=executing
On completion: quietly release
(same)
fetch[2]
free-keys
status=cancelled
previous=flight
On completion: quietly release
(same)
compute[1]
free-keys
compute[1]
status=executing (same)
fetch[2]
free-keys
fetch[2]
status=flight (same)
compute[1]
free-keys
fetch[2]
status=resumed
previous=executing
next=fetch
On success: add-keys
On compute failure: cluster deadlock (#6689)
On reschedule: InvalidTransition (#6685)
status=executing
previous=None
On success: task-finished
On compute failure: task-erred[5]; reschedule dependents[6]
On reschedule: reschedule task and its dependents[6]
fetch[2]
free-keys
compute[1]
status=resumed
previous=flight
next=waiting
On success: task-finished with bogus metrics
On peer failure [3]: transition to waiting
On (un)pickle failure [4]: task-erred
status=flight
previous=None
On success: add-keys
On peer failure [3]: transition to fetch or missing; on the scheduler side, request_who_has reschedules
On (un)pickle failure [4]: task-erred
fetch[2] status=flight
On success: add-keys
On peer failure [3]: transition to fetch or missing
On (un)pickle failure [4]: cluster deadlock (#6705)
status=flight
On success: add-keys
On peer failure [3]: transition to fetch or missing
On (un)pickle failure [4]: task-erred[5] and reschedule dependents[6]

Notes

[1] ComputeTaskEvent(key=<key>) [2] ComputeTaskEvent(who_has={<key>: [...]} or AcquireReplicasEvent(who_has={<key>: [...]}) [3] GatherDepSuccessEvent without the requested key or GatherDepNetworkFailureEvent [4] GatherDepFailureEvent, typically caused by a failure to unpickle, or GatherDepSuccessEvent for a task that is larger than 60% max_memory, is thus spilled immediately, and fails to pickle. [5] The task-erred messages introduce a new scheduler-side use case, where the scheduler receives a task-erred message for a task that is already in memory. At the moment, this use case is a no-op.

  • if ts.retries > 0, decrease it by one and, on the reporting worker only, release the task and all its dependents
  • if ts.retries = 0, transition the task to erred, release the task on the workers where it is in memory, and release all dependents everywhere.

[6] rescheduling waiters implies introducing a new waiting->rescheduled transition

TODO

  • The implementation does not fully reflect the above design yet
  • Failing tests:
    • FAILED distributed/tests/test_failed_workers.py::test_submit_after_failed_worker_sync
    • FAILED distributed/tests/test_failed_workers.py::test_submit_after_failed_worker_async[False]
    • FAILED distributed/tests/test_worker.py::test_task_flight_compute_oserror - a...
    • The whole test_cancelled_state.py (I didn't look at it yet)
  • Write more tests for waiters being sent back to the scheduler
  • Ensure all tickets above meet DoD

Regardless of TODOs, this is a gargantuan change which won't go in before @fjetter has come back and has had the time to thoroughly review it.

crusaderky avatar Jul 12 '22 12:07 crusaderky

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±    0         15 suites  ±0   6h 44m 54s :stopwatch: - 21m 1s   3 044 tests +  44    1 158 :heavy_check_mark:  - 1 750    33 :zzz:  -   56  109 :x: +106  1 744 :fire: +1 744  22 501 runs  +255  15 729 :heavy_check_mark:  - 5 463  790 :zzz:  - 261  771 :x: +768  5 211 :fire: +5 211 

For more details on these failures and errors, see this check.

Results for commit 2e48ace6. ± Comparison against base commit 1d0701b5.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jul 12 '22 13:07 github-actions[bot]

Not sure how best to add comments to the above table. I added another row with comments.

Events Main This PR Comment
compute[1]
free-keys
fetch[2]
status=resumed
previous=executing
next=fetch
On success: add-keys
On compute failure: cluster deadlock (#6689)
On reschedule: InvalidTransition (#6685)
status=executing
previous=None
On success: task-finished
On compute failure: task-erred[5]; reschedule dependents[6]
On reschedule: reschedule task and its dependents[6]
About compute failure: Why would we even send anything in this case? Why would we need to reschedule any dependents?
fetch[2]
free-keys
compute[1]
status=resumed
previous=flight
next=waiting
On success: task-finished with bogus metrics
On peer failure [3]: transition to waiting
On (un)pickle failure [4]: task-erred
status=flight
previous=None
On success: add-keys
On peer failure [3]: transition to fetch or missing; on the scheduler side, request_who_has reschedules
On (un)pickle failure [4]: task-erred
On peer failure - This is actually the only sane use case of this state. All other combinations are mostly artifacts due to race conditions. The fundamental intention of the resumed state is "wait for the fetch to time out and start to compute the task afterwards". A transition to fetch after this seems just wrong since the most recent scheduler request is clearly compute. Same for the state missing. How can a "computable task" be missing? I think by transitioning to fetch or missing we'd mix up semantics. I also don't see how the system would actually end up computing the task in the end. Can you elaborate more how this would work out?
fetch[2] status=flight
On success: add-keys
On peer failure [3]: transition to fetch or missing
On (un)pickle failure [4]: cluster deadlock (#6705)
status=flight
On success: add-keys
On peer failure [3]: transition to fetch or missing
On (un)pickle failure [4]: task-erred[5] and reschedule dependents[6]
I believe this is a known issue https://github.com/dask/distributed/issues/4439 I don't see how this connects to the resumed state. While not ideal, I do not consider this a major issue which would warrant such a big change

I fully acknowledge that resumed is not an intuitive state. However, the table above also doesn't feel very intuitive to me and I'd like to discuss intended behavior a bit more thoroughly before we commit on this approach.

I only skimmed the implementation. It doesn't look as bad as I thought but based on the table I would've expected some changes to the scheduler as well which makes me a bit nervous.

FWIW, as already outlined above, I believe the most important real world use case is fetch->free-keys->compute and we should really get this one right. IIUC this is working on main

fjetter avatar Aug 04 '22 16:08 fjetter

I'll have a closer look at the code and will provide more feedback about this proposal. I'm currently a bit skeptical about removing it since I think we need it as a complement to cancelled and I don't think we should remove cancelled, see https://github.com/dask/distributed/pull/6844#pullrequestreview-1064912014

fjetter avatar Aug 08 '22 10:08 fjetter