distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Refactor restart() and restart_workers()

Open crusaderky opened this issue 1 year ago • 2 comments

Refactor Client.restart() and Client.restart_workers() to use the same machinery for worker restart.

This is propaedeutic to #8537, where the scheduler will need to call restart_workers() internally.

  • Blocks #8538

Changed behaviour

  • restart() will no longer return self. It's weird and it was never advertised in the documentation anyway.
  • in restart_workers(), the scheduler will now unilaterally remove workers where the nanny failed to terminate the worker within timeout seconds. This aligns its behaviour to restart().
  • default timeout for restart_workers() has changed from infinity to distributed.comm.timeouts.connect * 4 (2 minutes by default). This aligns its behaviour to restart(). The method no longer accepts timeout=None.
  • restart_workers() no longer forwards the exception to the client if the worker fails to restart. While this is, strictly speaking, a loss of functionality, IMHO I can think of very few of use cases when this can happen, none of which sound particularly interesting to me. Note that this is specifically a use case where the nanny is healthy and in the middle of an RPC call with the scheduler, while the worker is so busted that it fails to restart within the timeout. The only semi-realistic use case I can think of is a WorkerPlugin that fails to acquire an external resource (e.g. connect to a database) on startup.

crusaderky avatar Mar 01 '24 15:03 crusaderky

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    29 files  ± 0      29 suites  ±0   11h 28m 31s :stopwatch: + 16m 19s  4 056 tests + 1   3 934 :white_check_mark:  -  1    109 :zzz: ±0  13 :x: +2  54 902 runs  +32  52 408 :white_check_mark: +36  2 409 :zzz:  - 6  85 :x: +2 

For more details on these failures, see this check.

Results for commit 3d3140f4. ± Comparison against base commit 8927bfd0.

This pull request removes 4 and adds 5 tests. Note that renamed tests count towards both.
distributed.tests.test_client ‑ test_restart_workers_exception[False]
distributed.tests.test_client ‑ test_restart_workers_exception[True]
distributed.tests.test_client ‑ test_restart_workers_timeout[False]
distributed.tests.test_client ‑ test_restart_workers_timeout[True]
distributed.tests.test_client ‑ test_restart_workers_exception
distributed.tests.test_client ‑ test_restart_workers_kill_timeout[False]
distributed.tests.test_client ‑ test_restart_workers_kill_timeout[True]
distributed.tests.test_client ‑ test_restart_workers_restart_timeout[False]
distributed.tests.test_client ‑ test_restart_workers_restart_timeout[True]

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Mar 01 '24 16:03 github-actions[bot]

All test failures are unrelated

crusaderky avatar Mar 04 '24 19:03 crusaderky

restart_workers() no longer forwards the exception to the client if the worker fails to restart. While this is, strictly speaking, a loss of functionality, IMHO I can think of very few of use cases when this can happen, none of which sound particularly interesting to me. Note that this is specifically a use case where the nanny is healthy and in the middle of an RPC call with the scheduler, while the worker is so busted that it fails to restart within the timeout. The only semi-realistic use case I can think of is a WorkerPlugin that fails to acquire an external resource (e.g. connect to a database) on startup.

I suppose another realistic scenario would be a task running on the worker that repeatedly blocks the GIL for an extended period of time? Granted, there's probably not much added value to a simple timed out message in that case.

hendrikmakait avatar Mar 20 '24 10:03 hendrikmakait