distributed Allow worker and scheduler to read logs from file

Currently all node subclasses (Scheduler, Nanny and Worker) log to an internal dequeue which can be queried from a Cluster or Client object via get_logs or worker_logs methods.

One downside of this is that not everything which would typically end up in stdout/stderr is logged by Dask. Some tracebacks and other startup output is not available.

In many deployment scenarios the output from the processes will be written to a file. This PR adds the ability to configure the scheduler, nanny and worker with the location of this file. So when a client or cluster queries the logs it will be read back from the file instead of the logging queue.

Toy example

dask-scheduler --log-file /tmp/scheduler.log 2>&1 | tee /tmp/scheduler.log

echo "hello world" >> /tmp/scheduler.log

>>> from dask.distributed import Client

>>> client = Client("tcp://localhost:8786", asynchronous=True)

>>> print("".join(list(await client.scheduler.get_logs())))                               
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:   tcp://10.51.100.15:8786
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.scheduler - INFO - Receive client connection: Client-91f8da1e-1ebc-11eb-b114-acde48001122
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:50186', name: tcp://127.0.0.1:50186, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:50186
distributed.core - INFO - Starting established connection
hello world

Real world usage

In dask-cloudprovider we have a new set of VM based cluster managers. Schedulers and workers are created via cloud-init which is a common configuration format for cloud VMs and is often passed as a field called user_data.

For example on AWS you can specify commands for your EC2 instance to run on launch. This is how the components in dask_cloudprovider.aws.EC2Cluster are run.

Cloud init logs to a file called /var/log/cloud-init-output.log which contains all the startup information about the VM followed by the stdout/stderr of the custom commands, in this case the Dask component.

With this change we can configure the log_file to /var/log/cloud-init-output.log which means that calling cluster.get_logs() will return the full contents of the cloud init output log instead of just the logging calls made by Dask.

Nov 04 '20 17:11 jacobtomlinson

How hard would it be to write a test for this where scheduler/worker output is redirected to a file and we set --log-file just as you do in the toy example above ?

Nov 04 '20 19:11 quasiben

Should be possible. Would probably need to inject some extra config into the gen_cluster somehow. Suggestions welcome :).

Nov 05 '20 11:11 jacobtomlinson

gen_cluster has scheduler_kwargs and worker_kwargs options which may be useful for testing here

https://github.com/dask/distributed/blob/d7f532caa1564ef09d456d60125c03200fa60fef/distributed/utils_test.py#L835-L836

Nov 05 '20 13:11 jrbourbeau

Thanks @jrbourbeau. That will satisfy one half, where we can set the file to read in from. Do you have thoughts on how we can get the logs written there too?

The assumption in this change is that whatever process starts the scheduler/worker will be directing the stdout/stderr into a file. So we would need our test to do that too.

Nov 06 '20 14:11 jacobtomlinson

Just noting that this would be quite helpful for pangeo's deployments, where regular users don't have access to stdout on the kubernetes pods.

Nov 06 '20 15:11 TomAugspurger

distributed distributed copied to clipboard

Allow worker and scheduler to read logs from file

Toy example

Real world usage

distributed
distributed copied to clipboard