distributed Distributed's Adaptive's Scheduler `retire_workers` and Cluster `scale

It appears that Distributed's Adaptive tells the Scheduler to retire_workers and then scale_down the Cluster. Is there a reason both of these operations are needed? Also what does this mean for downstream libraries inheriting from Distributed's Adaptive or Distributed's Cluster?

xref: https://github.com/dask/dask-drmaa/issues/65 (for context)

May 02 '18 19:05 jakirkham

Scale down is used here as a method of last resort to kill jobs after they should already have been closed down smoothly. We don't have to do this though. Retiring along is probably sufficient.

On Wed, May 2, 2018 at 3:09 PM, jakirkham [email protected] wrote:

It appears that Distributed's Adaptive tells the Scheduler to retire_workers and then scale_down the Cluster https://github.com/dask/distributed/blob/1.21.7/distributed/deploy/adaptive.py#L230-L235. Is there a reason both of these operations are needed? Also what does this mean for downstream libraries inheriting from Distributed's Adaptive or Distributed's Cluster?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1951, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJ3XDA_0H8yyTer51ESoe3oXdmxMks5tugRkgaJpZM4TwAF_ .

May 02 '18 20:05 mrocklin

I had understood the intended semantics to be different: retire_workers is for the Scheduler to tell Worker/Nanny processes to terminate; scale_down is at the cluster level, to tell a machine to shut down. Is this not the intention?

May 04 '18 10:05 rbubley

Yes, retire_workers tells the worker and nanny processes to terminate. Typically this results in the cluster-controlled job also being marked as finished at the resource manager level (kubernetes, yarn, sge, ...)

scale_down tells the resource manager (kubernetes, yarn, sge, ...) to terminate the job.

Nothing is able to tell the machine itself to shut down (I'm not sure if this was your intention). This typically isn't available to anyone except administrators at the physical location.

On Fri, May 4, 2018 at 6:30 AM, Russ Bubley [email protected] wrote:

I had understood the intended semantics to be different: retire_workers is for the Scheduler to tell Worker/Nanny processes to terminate; scale_down is at the cluster level, to tell a machine to shut down. Is this not the intention?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1951#issuecomment-386562197, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszAC8q_MCyPiP5F3_WjK2utD0a83fks5tvC29gaJpZM4TwAF_ .

May 04 '18 10:05 mrocklin

This might be a silly question, but how do these compare to stop_worker/stop_workers?

May 04 '18 14:05 jakirkham

stop_worker and stop_workers were older terms. We might consider deprecating them

On Fri, May 4, 2018 at 10:59 AM, jakirkham [email protected] wrote:

This might be a silly question, but how do these compare to stop_worker/ stop_workers?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1951#issuecomment-386628582, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszLv8kh9FhwCHGmaDzO8r2qOuF2cpks5tvGzcgaJpZM4TwAF_ .

May 04 '18 15:05 mrocklin

SGTM

May 04 '18 15:05 jakirkham

How does this relate to start_workers and scale_up?

Jun 25 '18 21:06 jakirkham

start workers is analagous to stop_workers. We might consider deprecating them. scale_up is currently used as part of the scale/up/down trio

Jun 25 '18 21:06 mrocklin

@jakirkham do you think more information needs to be added to the docs, or was this question more for your own personal understanding?

Oct 18 '21 06:10 GenevieveBuckley

Distributed's Adaptive's Scheduler `retire_workers` and Cluster `scale_down`