dask-cloudprovider
dask-cloudprovider copied to clipboard
ECSCluster does not de-provision tasks after failing to connect to scheduler
What happened:
If ECSCluster successfully runs a scheduler task but then fails to connect to the scheduler, an error is raised and the scheduler task is left active. The task has to be manually removed from the ECS cluster and deregistered.
In my case, this happened when I forgot to connect to a VPN network that would provide connectivity to the VPC. The task successfully ran, but no network connection to the scheduler was possible.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 286, in connect
comm = await asyncio.wait_for(
File "/usr/local/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 729, in __init__
super().__init__(**kwargs)
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 281, in __init__
self.sync(self._start)
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 189, in sync
return sync(self.loop, func, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
result[0] = yield future
File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 926, in _start
await super()._start()
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 314, in _start
await super()._start()
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 73, in _start
comm = await self.scheduler_comm.live_comm()
File "/usr/local/lib/python3.8/site-packages/distributed/core.py", line 746, in live_comm
comm = await connect(
File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
raise IOError(
OSError: Timed out trying to connect to tcp://10.53.13.110:8786 after 10 s
What you expected to happen:
If ECSCluster fails to connect to the running scheduler, it should catch the error and clean up the scheduler task.
Minimal Complete Verifiable Example:
Create an ECSCluster
instance with tasks in an unreachable network.
Anything else we need to know?:
None.
Environment:
- Dask version: 2012.2.0
- Python version: 3.8.6
- Operating System: Debian (python:3.8 Docker image)
- Install method (conda, pip, source): pip
Same problem with AzureVMCluster
>>> cluster = AzureVMCluster(
... resource_group=resource_group,
... location = location,
... vnet=vnet,
... security_group=security_group,
... n_workers=initial_node_count,
... vm_size=vm_size,
... docker_image=base_dockerfile,
... docker_args="--privileged",
... auto_shutdown=False,
... security=False,
... env_vars={"EXTRA_PIP_PACKAGES": EXTRA_PIP_PACKAGES},
... worker_class="dask_cuda.CUDAWorker")
Creating scheduler instance
Assigned public IP
Network interface ready
Creating VM
Created VM dask-4828a19b-scheduler
Waiting for scheduler to run at 13.66.2.14:8786
Scheduler is running
Traceback (most recent call last):
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\comm\core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\asyncio\tasks.py", line 494, in wait_for
return fut.result()
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\comm\tcp.py", line 215, in read
frames = unpack_frames(frames)
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\protocol\utils.py", line 70, in unpack_frames
(n_frames,) = struct.unpack_from(fmt, b)
struct.error: unpack_from requires a buffer of at least 8 bytes for unpacking 8 bytes at offset 0 (actual buffer size is 2)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\dask_cloudprovider\azure\azurevm.py", line 496, in __init__
super().__init__(**kwargs)
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\dask_cloudprovider\generic\vmcluster.py", line 284, in __init__
super().__init__(**kwargs, security=self.security)
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\deploy\spec.py", line 281, in __init__
self.sync(self._start)
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\deploy\cluster.py", line 189, in sync
return sync(self.loop, func, *args, **kwargs)
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\utils.py", line 351, in sync
raise exc.with_traceback(tb)
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\utils.py", line 334, in f
result[0] = yield future
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\tornado\gen.py", line 762, in run
value = future.result()
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\dask_cloudprovider\generic\vmcluster.py", line 324, in _start
await super()._start()
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\deploy\spec.py", line 314, in _start
await super()._start()
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\deploy\cluster.py", line 73, in _start
comm = await self.scheduler_comm.live_comm()
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\core.py", line 746, in live_comm
comm = await connect(
File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\comm\core.py", line 324, in connect
raise IOError(
OSError: Timed out during handshake while connecting to tcp://13.66.2.14:8786 after 10 s
It happened with both dask-cloudprovider[azure]==2021.1.1 and dask-cloudprovider[azure]==2021.3.0
As this seems to be a general problem with cluster managers based on SpecCluster
I've raised an upstream issue dask/distributed#4589.
@manuelreyesgomez I've seen struct.error: unpack_from requires a buffer of at least 8 bytes
a bunch recently and is usually due to mismatched Dask versions.
Please check your local dask/distributed versions match those in the container you are using.
@jacobtomlinson By container you mean the docker_image used in the constructor?
From a new Conda environment I pip install the latest version of the dask-cloudprovider[azure] and then run
resource_group = "NGC-AML-Quick-Launch" workspace_name = "NGC_AML_Quick_Launch_WS" vnet="NGC-AML-Quick-Launch-vnet" security_group="NGC-AML-Quick-Launch-nsg" initial_node_count = 2 vm_size = "Standard_NC12s_v3" location = "South Central US" base_dockerfile = "rapidsai/rapidsai-core-nightly:0.18-cuda10.2-runtime-ubuntu18.04-py3.8" EXTRA_PIP_PACKAGES = "dask-cloudprovider[azure]"
cluster = AzureVMCluster( resource_group=resource_group, location = location, vnet=vnet, security_group=security_group, n_workers=initial_node_count, vm_size=vm_size, docker_image=base_dockerfile, docker_args="--privileged", auto_shutdown=False, security=False, env_vars={"EXTRA_PIP_PACKAGES": EXTRA_PIP_PACKAGES}, worker_class="dask_cuda.CUDAWorker")
I pass the "dask-cloudprovider[azure]" as an extra pip install with the hopes of syncing dask and distributed version between the local machine and the scheduler/workers
Do not understand where the mismatch might be
By container you mean the docker_image used in the constructor?
Yes.
I pass the "dask-cloudprovider[azure]" as an extra pip install with the hopes of syncing dask and distributed version between the local machine and the scheduler/workers
This is not enough in itself. This will install dask-cloudprovider
but as dask
and distributed
are already installed it will just go with those.
You could either check the package versions in the container and install them locally, or you could set EXTRA_PIP_PACKAGES
to something like --upgrade dask distributed dask-cloudprovider[azure]
to force an upgrade to the latest versions in the Docker image at runtime.
you could set
EXTRA_PIP_PACKAGES
to something like--upgrade dask distributed dask-cloudprovider[azure]
to force an upgrade to the latest versions in the Docker image at runtime.
Thanks would try that then
@jacobtomlinson It seemed to have worked, thanks
Experimenting again similar behavior OSError: Timed out during handshake while connecting to tcp://104.215.83.219:8786 after 10 s
Seems to be a different underlying cause
distributed.protocol.core - CRITICAL - Failed to deserialize Traceback (most recent call last): File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/protocol/core.py", line 107, in loads small_payload = frames.pop() IndexError: pop from empty list Traceback (most recent call last): File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect handshake = await asyncio.wait_for(comm.read(), time_left()) File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/asyncio/tasks.py", line 494, in wait_for return fut.result() File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/comm/tcp.py", line 217, in read msg = await from_frames( File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/comm/utils.py", line 80, in from_frames res = _from_frames() File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/comm/utils.py", line 63, in _from_frames return protocol.loads( File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/protocol/core.py", line 107, in loads small_payload = frames.pop() IndexError: pop from empty list
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "
Print output:
print(AzureVMCluster.get_cloud_init( ... resource_group=resource_group, location, vnet=... location = location, ... vnet=vnet, curity_... security_group=security_group, ... n_workers=n_workers, ... vm_size=vm_size, ... docker_image=base_dockerfile, ... docker_args="--privileged", auto_sh... auto_shutdown=False, ecurity=... security=False, ... env_vars={"EXTRA_PIP_PACKAGES": EXTRA_PIP_PACKAGES}, orker_class="d... worker_class="dask_cuda.CUDAWorker")) #cloud-config
Bootstrap
packages:
- apt-transport-https
- ca-certificates
- curl
- gnupg-agent
- software-properties-common
- ubuntu-drivers-common
Enable ipv4 forwarding, required on CIS hardened machines
write_files:
- path: /etc/sysctl.d/enabled_ipv4_forwarding.conf content: | net.ipv4.conf.all.forwarding=1
create the docker group
groups:
- docker
Add default auto created user to docker group
system_info: default_user: groups: [docker]
runcmd:
Install Docker
- curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
- add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
- apt-get update -y
- apt-get install -y docker-ce docker-ce-cli containerd.io
- systemctl start docker
- systemctl enable docker
Install NVIDIA driver
- DEBIAN_FRONTEND=noninteractive ubuntu-drivers install
Install NVIDIA docker
- curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
- curl -s -L https://nvidia.github.io/nvidia-docker/$(. /etc/os-release;echo $ID$VERSION_ID)/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
- apt-get update -y
- apt-get install -y nvidia-docker2
- systemctl restart docker
Run container
- 'docker run --net=host --gpus=all -e EXTRA_PIP_PACKAGES="--upgrade dask distributed dask-cloudprovider[azure]" --privileged rapidsai/rapidsai:0.18-cuda11.0-runtime-ubuntu18.04-py3.8 dask-scheduler --version'
Python 3.8 dask:2021.03.0 distributed:2021.03.0 dask-cloudprovider:2021.3.0 installed with pip
Thanks @manuelreyesgomez.
Deserialization issues tend to be either a Python or Dask version mismatch between your local environment and the docker image.
In terms of cleaning up after failure this is covered in #277
@jacobtomlinson Thanks that was it, seems one needs to have a tight control on the package versions when doing pip installs on top the docker image. will comment on the cleanup after failure