Eliminate partially-removed-worker state on scheduler (comms open, state removed)
Scheduler.remove_worker removes state regarding the worker (self.workers[addr], self.stream_comms[addr], etc.), but does not close the actual network connections to the worker. This is even codified in the close=False option, which supports removing the worker state, but not telling the worker to shut down or to disconnect.
Keeping the network connections open (and listening to them) is essentially a half-removed state. The scheduler no longer knows about the worker, but if the worker sends it updates over the open connection, it will respond to them (potentially invoking handlers that assume the worker state is there).
There are two things to figure out:
- What does it mean for a worker to be "there" or "not there", from the scheduler's perspective?
- i.e. is it only that
self.workers[addr]exists? Or alsoself.stream_comms[addr], and other such fields? Is there aself.handle_workercoroutine running for that worker too? - Can there be a single point of truth for this? A single dict to check? Or method to call?
- i.e. is it only that
- How can
Scheduler.remove_workerensure that:- after it returns, the worker is fully "not there"
- if it yields control while it's running (via
await), things are in a well-defined state (worker is either "there", or "not there", or maybe even in a "closing" state, but no half-removed state like we have currently) - if multiple
remove_workercoroutines run concurrently, everything remains consistent - if multiple
remove_workercoroutines run concurrently, the second one does not return until the worker is actually removed (i.e. the first coroutine has completed)
Addresses https://github.com/dask/distributed/issues/6354
Is there any ETA for this?
There's an extra layer of complexity added to this when Scheduler.retire_workers and its parameter flags come into play:
close_workers=False, retire=False
The worker sits forever in status=closing_gracefully
close_workers=True, retire=False
Calls Scheduler.close_worker(), which kindly asks the worker to shut itself down. This API makes no sense to me.
close_workers=False, retire=True
Calls Scheduler.remove_worker(close=False) as described above.
This is the default behaviour of Scheduler.retire_workers().
close_workers=True, retire=True
Shut the workers and the nannies down and remove them.
This is the default behaviour of Client.retire_workers(), and it differs from the scheduler-side API.
I suspect the below is purely hypothetical, but I'll note it nonetheless.
At the moment, there is no simple API for graceful worker restart. For example, it would make sense for cleaning up a memory leak on a worker without losing the data on it. Currently you can do
client.retire_workers([addr], close_workers=False, remove=False)
client.restart_workers([addr])
but with the removal of the flags from retire_workers, it would become impossible.