distributed
distributed copied to clipboard
Retire workers in SpecCluster by name
Currently there's a mismatch between how workers are stored on the scheduler and spec cluster. Scheduler.workers
uses worker addresses for keys while SpecCluster.workers
uses the output of SpecCluster._new_worker_name
for keys (which is often an integer that's incremented when a new worker is added). This mismatch results in workers not being retired properly here (xref https://github.com/dask/distributed/issues/4069):
https://github.com/dask/distributed/blob/586ded30a8f15b91bd4e45e50c832ef21faea79c/distributed/deploy/spec.py#L328
This PR updates how we call retire_workers
to use worker names instead of instead of addresses which will ensure the Scheduler
and SpecCluster
are on the same page about which workers should be retired.
Note that since the key stored in SpecCluster.workers
isn't always used as the corresponding worker's name:
https://github.com/dask/distributed/blob/586ded30a8f15b91bd4e45e50c832ef21faea79c/distributed/deploy/spec.py#L343-L345
I added a new SpecCluster._spec_name_to_worker_name
mapping which maps between the name stored in SpecCluster.workers
and the actual name used for the worker that's created.
Closes https://github.com/dask/distributed/issues/4069
cc @bbockelm
@jacobtomlinson if you have a moment would you be able to look this over ?
Note that since the key stored in SpecCluster.workers isn't always used as the corresponding worker's name:
We could also make this a requirement. I don't recall why this distinction was made, but it might not be strictly required, and this sounds like the kind of thing that would make things more consistent.
Also, a corner case to watch out for with SpecCluster is multi-worker jobs (grep for MultiWorker
for a test). This comes up with dask-cuda and with dask-jobqueue.
This looks reasonable to me. But I want to echo what Matt said about multiple workers. A worker object in respect to SpecCluster
may reference mutliple tightly coupled workers, like in dask-cuda
.
When you run dask-cuda-worker
it inspects the number of GPUs and spawns one worker per GPU. So it may not always be possible to reconsile a 1-to-1 relationship.
Hi, I'm interested in fixing this too, I updated it and added support for MultiWorkers ( https://github.com/dask/distributed/pull/6065 ), locally the only failing tests seem unrelated, at this point if I could have some guidance on how to fix them that would be great. Maybe using a stable version to base the patch on?