distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Worker process dying unexpectedly should not result in a zero exit code

Open shughes-uk opened this issue 2 years ago • 3 comments

I'm seeing this in the logs

distributed.nanny - INFO - Closing Nanny gracefully at 'tls://xx'. Reason: worker-handle-scheduler-connection-broken

but the nanny process is returning a 0 exit code. This very much feels like it should return a non-zero exit code. Otherwise we can't really differentiate between dask workers that are exiting for sad reasons , and those that are just responding to a sigterm or scheduler scale down request.

Best case the nanny should exit with a variety of documented exit codes depending on the bad exit condition. Next best case pick an exit code and use that for every 'bad' exit.

shughes-uk avatar Jul 06 '23 02:07 shughes-uk

Why should a graceful close return a non-zero exit code? (FWIW I think the log message you are posting is mixing up things between Nanny and Worker).

What kind of reasons would be useful to you? Most useful information that can tell you what is going on with the Worker is not available to the Nanny. These are two different processes and we currently don't have a solid infrastructure in place to bubble application logic up to the nanny process. If there is something interesting happening, there are typically logs for this.

fjetter avatar Jul 18 '23 10:07 fjetter

@jrbourbeau helped me go spelunking in the distributed code a bit.

Here's an example where the worker is calling close in a way that shuts down the nanny gracefully, but is definitely not good.

https://github.com/dask/distributed/blob/b7e5f8f97ef0eb95368122f29ac8915ae692f94e/distributed/worker.py#L1295-L1297

In another situation where the scheduler event loop is blocked for a long time you get the following logs on the worker

distributed.preloading - INFO - Run preload teardown: https://cloud.coiled.io/api/v2/cluster_facing/preload/worker
distributed.nanny - INFO - Closing Nanny at 'tls://10.1.93.142:39799'.
distributed.nanny - ERROR - Worker process died unexpectedly
distributed.nanny - INFO - Worker closed
distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-e31a90a0-51c1-4207-a4db-19e0c3d50fb3 Address tls://10.1.93.142:46879 Status: Status.closing
distributed.worker - INFO - Stopping worker at tls://10.1.93.142:46879
distributed.worker - INFO - Timed out while trying to connect during heartbeat

Which seems to come from here

https://github.com/dask/distributed/blob/b7e5f8f97ef0eb95368122f29ac8915ae692f94e/distributed/nanny.py#L913-L921

This should absolutely result in a nonzero exit code on the nanny.

These things all show up in logs, which is fine if a human is reading them. Exit codes would allow robust programatic monitoring for worker issues.

For example with Coiled this would allow me quickly query for "bad" worker deaths across all clusters, currently only catastrophic errors that kill the nanny can be easily found without painstaking review of individual cluster logs. You can kind of see it by looking for clusters where the worker count decreases, but it's hard to tell that apart from legitimate scaling activity a lot of the time.

shughes-uk avatar Jul 18 '23 18:07 shughes-uk

@florian-jetter-by Some data to share with you here https://github.com/coiled/dask-engineering/issues/186

shughes-uk avatar Jul 18 '23 22:07 shughes-uk