distributed
distributed copied to clipboard
starting dask.distributed.Client with default settings results in endless restarting of workers
What happened:
- starting Client locally with default settings results in endless restarting of workers (with processes=True)
- Note: this happened after updating my conda environment from Dask 2021.3.0 and Python 3.8.8 (it was ok with these versions before)
What you expected to happen: for the client to start
Minimal Complete Verifiable Example:
This fails:
from dask.distributed import Client
client = Client(processes=True)
This starts:
from dask.distributed import Client
client = Client(processes=False)
Anything else we need to know?:
Running from Jupyter notebook
Environment:
- Dask version: 2022.9.0
- Distributed version: 2022.9.0
- Python version: 3.9.13
- Operating System: Windows 10
- Install method conda-forge
Error output on failing example:
2022-09-14 09:17:43,926 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,939 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,949 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,962 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,969 - distributed.nanny - WARNING - Restarting worker
Traceback (most recent call last):
File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\nanny.py", line 822, in _wait_until_connected
msg = self.init_result_q.get_nowait()
File "c:\Users\myuser\Miniconda3\envs\sim\lib\multiprocessing\queues.py", line 135, in get_nowait
return self.get(False)
File "c:\Users\myuser\Miniconda3\envs\sim\lib\multiprocessing\queues.py", line 116, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\utils.py", line 799, in wrapper
return await func(*args, **kwargs)
File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\nanny.py", line 539, in _on_worker_exit
await self.instantiate()
File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\nanny.py", line 438, in instantiate
result = await self.process.start()
File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\nanny.py", line 695, in start
msg = await self._wait_until_connected(uid)
File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\nanny.py", line 824, in _wait_until_connected
await asyncio.sleep(self._init_msg_interval)
File "c:\Users\myuser\Miniconda3\envs\sim\lib\asyncio\tasks.py", line 652, in sleep
return await future
asyncio.exceptions.CancelledError
2022-09-14 09:17:43,926 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,939 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,949 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,962 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,969 - distributed.nanny - WARNING - Restarting worker
...
A little update, I progressively tried rolling back Dask versions from 2022.9.0 to 2021.3.0 and tested each and saw the same issue. My theory was that shouldn't happen since 2021.3.0 was ok before.
So, next to rule out any environment oddities, I reverted to my old conda environment (via a yaml backup) with Python 3.8.8 and Dask 2021.3.0 and am still seeing the same restarting worker behavior... This time I ran it with the normal interpreter too (not Jupyter) and it made no difference. A bit perplexing...
This is happening for me also but just in Zeppelin (jupyter is working fine)
specifying processes=True
(which is the default) for LocalCluster
or Client
make the paragraph hangs forever in zeppelin (see examples below):
%python
from dask.distributed import Client
client = Client()
client
or
%python
from dask.distributed import LocalCluster
cluster = LocalCluster()
cluster
From the Dask documentation, the client:
will check your local Dask config and environment variables to see if connection information has been specified. If not it will create an instance of LocalCluster and use that.
so the problem is rather in the LocalCluster
with the processes
param set to True
for some reason.
Could you please advise why this is only happening for zeppelin, we are really stuck...
Thank you in advance.
After combing through old issues, I believe my issue is a duplicate of this one (https://github.com/dask/distributed/issues/5574). Kind of amazing this issue still exists after all these years, it seems to be a very low priority for Microsoft... This is the issue in VSCode's repo (https://github.com/microsoft/vscode-jupyter/issues/2962).
My problem went away once I stopped using the VSCode Python Interactive Window (I switched back to Atom and its Hydrogen plugin). Of course, running the script from the shell is ok too.