Distributed's Adaptive's Scheduler `retire_workers` and Cluster `scale_down`
It appears that Distributed's Adaptive tells the Scheduler to retire_workers and then scale_down the Cluster. Is there a reason both of these operations are needed? Also what does this mean for downstream libraries inheriting from Distributed's Adaptive or Distributed's Cluster?
xref: https://github.com/dask/dask-drmaa/issues/65 (for context)
Scale down is used here as a method of last resort to kill jobs after they should already have been closed down smoothly. We don't have to do this though. Retiring along is probably sufficient.
On Wed, May 2, 2018 at 3:09 PM, jakirkham [email protected] wrote:
It appears that Distributed's Adaptive tells the Scheduler to retire_workers and then scale_down the Cluster https://github.com/dask/distributed/blob/1.21.7/distributed/deploy/adaptive.py#L230-L235. Is there a reason both of these operations are needed? Also what does this mean for downstream libraries inheriting from Distributed's Adaptive or Distributed's Cluster?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1951, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJ3XDA_0H8yyTer51ESoe3oXdmxMks5tugRkgaJpZM4TwAF_ .
I had understood the intended semantics to be different: retire_workers is for the Scheduler to tell Worker/Nanny processes to terminate; scale_down is at the cluster level, to tell a machine to shut down. Is this not the intention?
Yes, retire_workers tells the worker and nanny processes to terminate. Typically this results in the cluster-controlled job also being marked as finished at the resource manager level (kubernetes, yarn, sge, ...)
scale_down tells the resource manager (kubernetes, yarn, sge, ...) to terminate the job.
Nothing is able to tell the machine itself to shut down (I'm not sure if this was your intention). This typically isn't available to anyone except administrators at the physical location.
On Fri, May 4, 2018 at 6:30 AM, Russ Bubley [email protected] wrote:
I had understood the intended semantics to be different: retire_workers is for the Scheduler to tell Worker/Nanny processes to terminate; scale_down is at the cluster level, to tell a machine to shut down. Is this not the intention?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1951#issuecomment-386562197, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszAC8q_MCyPiP5F3_WjK2utD0a83fks5tvC29gaJpZM4TwAF_ .
This might be a silly question, but how do these compare to stop_worker/stop_workers?
stop_worker and stop_workers were older terms. We might consider deprecating them
On Fri, May 4, 2018 at 10:59 AM, jakirkham [email protected] wrote:
This might be a silly question, but how do these compare to stop_worker/ stop_workers?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1951#issuecomment-386628582, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszLv8kh9FhwCHGmaDzO8r2qOuF2cpks5tvGzcgaJpZM4TwAF_ .
SGTM
How does this relate to start_workers and scale_up?
start workers is analagous to stop_workers. We might consider deprecating them. scale_up is currently used as part of the scale/up/down trio
@jakirkham do you think more information needs to be added to the docs, or was this question more for your own personal understanding?