distributed
distributed copied to clipboard
Worker process dying unexpectedly should not result in a zero exit code
I'm seeing this in the logs
distributed.nanny - INFO - Closing Nanny gracefully at 'tls://xx'. Reason: worker-handle-scheduler-connection-broken
but the nanny process is returning a 0 exit code. This very much feels like it should return a non-zero exit code. Otherwise we can't really differentiate between dask workers that are exiting for sad reasons , and those that are just responding to a sigterm or scheduler scale down request.
Best case the nanny should exit with a variety of documented exit codes depending on the bad exit condition. Next best case pick an exit code and use that for every 'bad' exit.
Why should a graceful close return a non-zero exit code? (FWIW I think the log message you are posting is mixing up things between Nanny and Worker).
What kind of reasons would be useful to you? Most useful information that can tell you what is going on with the Worker is not available to the Nanny. These are two different processes and we currently don't have a solid infrastructure in place to bubble application logic up to the nanny process. If there is something interesting happening, there are typically logs for this.
@jrbourbeau helped me go spelunking in the distributed code a bit.
Here's an example where the worker is calling close in a way that shuts down the nanny gracefully, but is definitely not good.
https://github.com/dask/distributed/blob/b7e5f8f97ef0eb95368122f29ac8915ae692f94e/distributed/worker.py#L1295-L1297
In another situation where the scheduler event loop is blocked for a long time you get the following logs on the worker
distributed.preloading - INFO - Run preload teardown: https://cloud.coiled.io/api/v2/cluster_facing/preload/worker
distributed.nanny - INFO - Closing Nanny at 'tls://10.1.93.142:39799'.
distributed.nanny - ERROR - Worker process died unexpectedly
distributed.nanny - INFO - Worker closed
distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-e31a90a0-51c1-4207-a4db-19e0c3d50fb3 Address tls://10.1.93.142:46879 Status: Status.closing
distributed.worker - INFO - Stopping worker at tls://10.1.93.142:46879
distributed.worker - INFO - Timed out while trying to connect during heartbeat
Which seems to come from here
https://github.com/dask/distributed/blob/b7e5f8f97ef0eb95368122f29ac8915ae692f94e/distributed/nanny.py#L913-L921
This should absolutely result in a nonzero exit code on the nanny.
These things all show up in logs, which is fine if a human is reading them. Exit codes would allow robust programatic monitoring for worker issues.
For example with Coiled this would allow me quickly query for "bad" worker deaths across all clusters, currently only catastrophic errors that kill the nanny can be easily found without painstaking review of individual cluster logs. You can kind of see it by looking for clusters where the worker count decreases, but it's hard to tell that apart from legitimate scaling activity a lot of the time.
@florian-jetter-by Some data to share with you here https://github.com/coiled/dask-engineering/issues/186