distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Distributed's Adaptive's Scheduler `retire_workers` and Cluster `scale_down`

Open jakirkham opened this issue 7 years ago • 9 comments

It appears that Distributed's Adaptive tells the Scheduler to retire_workers and then scale_down the Cluster. Is there a reason both of these operations are needed? Also what does this mean for downstream libraries inheriting from Distributed's Adaptive or Distributed's Cluster?

xref: https://github.com/dask/dask-drmaa/issues/65 (for context)

jakirkham avatar May 02 '18 19:05 jakirkham

Scale down is used here as a method of last resort to kill jobs after they should already have been closed down smoothly. We don't have to do this though. Retiring along is probably sufficient.

On Wed, May 2, 2018 at 3:09 PM, jakirkham [email protected] wrote:

It appears that Distributed's Adaptive tells the Scheduler to retire_workers and then scale_down the Cluster https://github.com/dask/distributed/blob/1.21.7/distributed/deploy/adaptive.py#L230-L235. Is there a reason both of these operations are needed? Also what does this mean for downstream libraries inheriting from Distributed's Adaptive or Distributed's Cluster?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1951, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJ3XDA_0H8yyTer51ESoe3oXdmxMks5tugRkgaJpZM4TwAF_ .

mrocklin avatar May 02 '18 20:05 mrocklin

I had understood the intended semantics to be different: retire_workers is for the Scheduler to tell Worker/Nanny processes to terminate; scale_down is at the cluster level, to tell a machine to shut down. Is this not the intention?

rbubley avatar May 04 '18 10:05 rbubley

Yes, retire_workers tells the worker and nanny processes to terminate. Typically this results in the cluster-controlled job also being marked as finished at the resource manager level (kubernetes, yarn, sge, ...)

scale_down tells the resource manager (kubernetes, yarn, sge, ...) to terminate the job.

Nothing is able to tell the machine itself to shut down (I'm not sure if this was your intention). This typically isn't available to anyone except administrators at the physical location.

On Fri, May 4, 2018 at 6:30 AM, Russ Bubley [email protected] wrote:

I had understood the intended semantics to be different: retire_workers is for the Scheduler to tell Worker/Nanny processes to terminate; scale_down is at the cluster level, to tell a machine to shut down. Is this not the intention?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1951#issuecomment-386562197, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszAC8q_MCyPiP5F3_WjK2utD0a83fks5tvC29gaJpZM4TwAF_ .

mrocklin avatar May 04 '18 10:05 mrocklin

This might be a silly question, but how do these compare to stop_worker/stop_workers?

jakirkham avatar May 04 '18 14:05 jakirkham

stop_worker and stop_workers were older terms. We might consider deprecating them

On Fri, May 4, 2018 at 10:59 AM, jakirkham [email protected] wrote:

This might be a silly question, but how do these compare to stop_worker/ stop_workers?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1951#issuecomment-386628582, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszLv8kh9FhwCHGmaDzO8r2qOuF2cpks5tvGzcgaJpZM4TwAF_ .

mrocklin avatar May 04 '18 15:05 mrocklin

SGTM

jakirkham avatar May 04 '18 15:05 jakirkham

How does this relate to start_workers and scale_up?

jakirkham avatar Jun 25 '18 21:06 jakirkham

start workers is analagous to stop_workers. We might consider deprecating them. scale_up is currently used as part of the scale/up/down trio

mrocklin avatar Jun 25 '18 21:06 mrocklin

@jakirkham do you think more information needs to be added to the docs, or was this question more for your own personal understanding?

GenevieveBuckley avatar Oct 18 '21 06:10 GenevieveBuckley