[Question] Dask on Cluster
Description
I want to parallelize tuning on a slurm cluster using Dask. Specifically, I used the hydra sweeper. The problem is that the workers don't connect to the Dask cluster and simply time out.
Steps/Code to Reproduce
I got the error both on my own code and the multifidelity MLP example here: https://github.com/automl/hydra-smac-sweeper/blob/main/examples/multifidelity_mlp.py
Expected Results
Ideally the cluster would distribute the function evaluation runs to a new slurm script which would in turn report back to the cluster.
Actual Results
I get this error:
[WARNING][dask_runner.py:135] No workers are available. This could mean workers crashed. Waiting for new workers...
Traceback (most recent call last):
File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/site-packages/hydra/_internal/utils.py", line 219, in run_and_report
return func()
File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/site-packages/hydra/_internal/utils.py", line 466, in
Dask suggests printing and running the job script by itself which looks correct to me, but gives this error:
Traceback (most recent call last): File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/site-packages/distributed/comm/core.py", line 289, in connect comm = await asyncio.wait_for( File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/asyncio/tasks.py", line 479, in wait_for return fut.result() File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/site-packages/distributed/comm/tcp.py", line 444, in connect convert_stream_closed_error(self, e) File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/site-packages/distributed/comm/tcp.py", line 133, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.class.name}: {exc}") from exc distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x2ac245b74ca0>: ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/site-packages/distributed/core.py", line 287, in _ await asyncio.wait_for(self.start(), timeout=timeout) File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/asyncio/tasks.py", line 479, in wait_for return fut.result() File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/site-packages/distributed/nanny.py", line 329, in start msg = await self.scheduler.register_nanny() File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/site-packages/distributed/core.py", line 919, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/site-packages/distributed/core.py", line 1089, in connect comm = await fut File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/site-packages/distributed/comm/core.py", line 315, in connect raise OSError( OSError: Timed out trying to connect to tcp://130.75.7.144:35161 after 30 s
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/bigwork/nhwpeimt/miniconda3/envs/autorl/lib/python3.9/site-packages/distributed/cli/dask_worker.py", line 495, in
Since this cluster blocks internet access, I can imagine it also restricts connections between nodes in some ways. I didn't really find much information on this in the dask documentation and also couldn't solve it so far. Maybe this is something you know and would want to include in the documentation, though?
Versions
'2.0.1'
Related to #1016. Possibly fixed by #1032. Will investigate 🌻