distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Retire workers in SpecCluster by name

Open jrbourbeau opened this issue 4 years ago • 5 comments

Currently there's a mismatch between how workers are stored on the scheduler and spec cluster. Scheduler.workers uses worker addresses for keys while SpecCluster.workers uses the output of SpecCluster._new_worker_name for keys (which is often an integer that's incremented when a new worker is added). This mismatch results in workers not being retired properly here (xref https://github.com/dask/distributed/issues/4069):

https://github.com/dask/distributed/blob/586ded30a8f15b91bd4e45e50c832ef21faea79c/distributed/deploy/spec.py#L328

This PR updates how we call retire_workers to use worker names instead of instead of addresses which will ensure the Scheduler and SpecCluster are on the same page about which workers should be retired.

Note that since the key stored in SpecCluster.workers isn't always used as the corresponding worker's name:

https://github.com/dask/distributed/blob/586ded30a8f15b91bd4e45e50c832ef21faea79c/distributed/deploy/spec.py#L343-L345

I added a new SpecCluster._spec_name_to_worker_name mapping which maps between the name stored in SpecCluster.workers and the actual name used for the worker that's created.

Closes https://github.com/dask/distributed/issues/4069

jrbourbeau avatar Aug 25 '20 19:08 jrbourbeau

cc @bbockelm

jrbourbeau avatar Aug 25 '20 19:08 jrbourbeau

@jacobtomlinson if you have a moment would you be able to look this over ?

quasiben avatar Aug 26 '20 13:08 quasiben

Note that since the key stored in SpecCluster.workers isn't always used as the corresponding worker's name:

We could also make this a requirement. I don't recall why this distinction was made, but it might not be strictly required, and this sounds like the kind of thing that would make things more consistent.

Also, a corner case to watch out for with SpecCluster is multi-worker jobs (grep for MultiWorker for a test). This comes up with dask-cuda and with dask-jobqueue.

mrocklin avatar Aug 26 '20 14:08 mrocklin

This looks reasonable to me. But I want to echo what Matt said about multiple workers. A worker object in respect to SpecCluster may reference mutliple tightly coupled workers, like in dask-cuda.

When you run dask-cuda-worker it inspects the number of GPUs and spawns one worker per GPU. So it may not always be possible to reconsile a 1-to-1 relationship.

jacobtomlinson avatar Aug 26 '20 14:08 jacobtomlinson

Hi, I'm interested in fixing this too, I updated it and added support for MultiWorkers ( https://github.com/dask/distributed/pull/6065 ), locally the only failing tests seem unrelated, at this point if I could have some guidance on how to fix them that would be great. Maybe using a stable version to base the patch on?

LunarLanding avatar Apr 05 '22 16:04 LunarLanding