distributed
distributed copied to clipboard
Allow worker and scheduler to read logs from file
Currently all node subclasses (Scheduler
, Nanny
and Worker
) log to an internal dequeue which can be queried from a Cluster
or Client
object via get_logs
or worker_logs
methods.
One downside of this is that not everything which would typically end up in stdout/stderr is logged by Dask. Some tracebacks and other startup output is not available.
In many deployment scenarios the output from the processes will be written to a file. This PR adds the ability to configure the scheduler, nanny and worker with the location of this file. So when a client or cluster queries the logs it will be read back from the file instead of the logging queue.
Toy example
dask-scheduler --log-file /tmp/scheduler.log 2>&1 | tee /tmp/scheduler.log
echo "hello world" >> /tmp/scheduler.log
>>> from dask.distributed import Client
>>> client = Client("tcp://localhost:8786", asynchronous=True)
>>> print("".join(list(await client.scheduler.get_logs())))
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://10.51.100.15:8786
distributed.scheduler - INFO - dashboard at: :8787
distributed.scheduler - INFO - Receive client connection: Client-91f8da1e-1ebc-11eb-b114-acde48001122
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:50186', name: tcp://127.0.0.1:50186, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:50186
distributed.core - INFO - Starting established connection
hello world
Real world usage
In dask-cloudprovider
we have a new set of VM based cluster managers. Schedulers and workers are created via cloud-init which is a common configuration format for cloud VMs and is often passed as a field called user_data
.
For example on AWS you can specify commands for your EC2 instance to run on launch. This is how the components in dask_cloudprovider.aws.EC2Cluster
are run.
Cloud init logs to a file called /var/log/cloud-init-output.log
which contains all the startup information about the VM followed by the stdout/stderr of the custom commands, in this case the Dask component.
With this change we can configure the log_file
to /var/log/cloud-init-output.log
which means that calling cluster.get_logs()
will return the full contents of the cloud init output log instead of just the logging calls made by Dask.
How hard would it be to write a test for this where scheduler/worker output is redirected to a file and we set --log-file
just as you do in the toy example above ?
Should be possible. Would probably need to inject some extra config into the gen_cluster
somehow. Suggestions welcome :).
gen_cluster
has scheduler_kwargs
and worker_kwargs
options which may be useful for testing here
https://github.com/dask/distributed/blob/d7f532caa1564ef09d456d60125c03200fa60fef/distributed/utils_test.py#L835-L836
Thanks @jrbourbeau. That will satisfy one half, where we can set the file to read in from. Do you have thoughts on how we can get the logs written there too?
The assumption in this change is that whatever process starts the scheduler/worker will be directing the stdout/stderr into a file. So we would need our test to do that too.
Just noting that this would be quite helpful for pangeo's deployments, where regular users don't have access to stdout on the kubernetes pods.