dask-cloudprovider icon indicating copy to clipboard operation
dask-cloudprovider copied to clipboard

ECSCluster does not de-provision tasks after failing to connect to scheduler

Open kinghuang opened this issue 3 years ago • 11 comments

What happened:

If ECSCluster successfully runs a scheduler task but then fails to connect to the scheduler, an error is raised and the scheduler task is left active. The task has to be manually removed from the ECS cluster and deregistered.

In my case, this happened when I forgot to connect to a VPN network that would provide connectivity to the VPC. The task successfully ran, but no network connection to the scheduler was possible.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 286, in connect
    comm = await asyncio.wait_for(
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 729, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 281, in __init__
    self.sync(self._start)
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 189, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 926, in _start
    await super()._start()
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 314, in _start
    await super()._start()
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 73, in _start
    comm = await self.scheduler_comm.live_comm()
  File "/usr/local/lib/python3.8/site-packages/distributed/core.py", line 746, in live_comm
    comm = await connect(
  File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
    raise IOError(
OSError: Timed out trying to connect to tcp://10.53.13.110:8786 after 10 s

What you expected to happen:

If ECSCluster fails to connect to the running scheduler, it should catch the error and clean up the scheduler task.

Minimal Complete Verifiable Example:

Create an ECSCluster instance with tasks in an unreachable network.

Anything else we need to know?:

None.

Environment:

  • Dask version: 2012.2.0
  • Python version: 3.8.6
  • Operating System: Debian (python:3.8 Docker image)
  • Install method (conda, pip, source): pip

kinghuang avatar Mar 05 '21 22:03 kinghuang

Same problem with AzureVMCluster

>>> cluster = AzureVMCluster(
... resource_group=resource_group,
... location = location,
... vnet=vnet,
... security_group=security_group,
... n_workers=initial_node_count,
... vm_size=vm_size,
... docker_image=base_dockerfile,
... docker_args="--privileged",
... auto_shutdown=False,
... security=False,
... env_vars={"EXTRA_PIP_PACKAGES": EXTRA_PIP_PACKAGES},
... worker_class="dask_cuda.CUDAWorker")
Creating scheduler instance
Assigned public IP
Network interface ready
Creating VM
Created VM dask-4828a19b-scheduler
Waiting for scheduler to run at 13.66.2.14:8786
Scheduler is running
Traceback (most recent call last):
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\comm\core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\asyncio\tasks.py", line 494, in wait_for
    return fut.result()
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\comm\tcp.py", line 215, in read
    frames = unpack_frames(frames)
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\protocol\utils.py", line 70, in unpack_frames
    (n_frames,) = struct.unpack_from(fmt, b)
struct.error: unpack_from requires a buffer of at least 8 bytes for unpacking 8 bytes at offset 0 (actual buffer size is 2)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\dask_cloudprovider\azure\azurevm.py", line 496, in __init__
    super().__init__(**kwargs)
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\dask_cloudprovider\generic\vmcluster.py", line 284, in __init__
    super().__init__(**kwargs, security=self.security)
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\deploy\spec.py", line 281, in __init__
    self.sync(self._start)
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\deploy\cluster.py", line 189, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\utils.py", line 351, in sync
    raise exc.with_traceback(tb)
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\utils.py", line 334, in f
    result[0] = yield future
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\tornado\gen.py", line 762, in run
    value = future.result()
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\dask_cloudprovider\generic\vmcluster.py", line 324, in _start
    await super()._start()
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\deploy\spec.py", line 314, in _start
    await super()._start()
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\deploy\cluster.py", line 73, in _start
    comm = await self.scheduler_comm.live_comm()
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\core.py", line 746, in live_comm
    comm = await connect(
  File "C:\Users\mreyesgomez\Anaconda3\envs\AzureVMCluster_3_8_03\lib\site-packages\distributed\comm\core.py", line 324, in connect
    raise IOError(
OSError: Timed out during handshake while connecting to tcp://13.66.2.14:8786 after 10 s

manuelreyesgomez avatar Mar 12 '21 23:03 manuelreyesgomez

It happened with both dask-cloudprovider[azure]==2021.1.1 and dask-cloudprovider[azure]==2021.3.0

manuelreyesgomez avatar Mar 12 '21 23:03 manuelreyesgomez

As this seems to be a general problem with cluster managers based on SpecCluster I've raised an upstream issue dask/distributed#4589.

jacobtomlinson avatar Mar 15 '21 10:03 jacobtomlinson

@manuelreyesgomez I've seen struct.error: unpack_from requires a buffer of at least 8 bytes a bunch recently and is usually due to mismatched Dask versions.

Please check your local dask/distributed versions match those in the container you are using.

jacobtomlinson avatar Mar 15 '21 16:03 jacobtomlinson

@jacobtomlinson By container you mean the docker_image used in the constructor?

From a new Conda environment I pip install the latest version of the dask-cloudprovider[azure] and then run

resource_group = "NGC-AML-Quick-Launch" workspace_name = "NGC_AML_Quick_Launch_WS" vnet="NGC-AML-Quick-Launch-vnet" security_group="NGC-AML-Quick-Launch-nsg" initial_node_count = 2 vm_size = "Standard_NC12s_v3" location = "South Central US" base_dockerfile = "rapidsai/rapidsai-core-nightly:0.18-cuda10.2-runtime-ubuntu18.04-py3.8" EXTRA_PIP_PACKAGES = "dask-cloudprovider[azure]"

cluster = AzureVMCluster( resource_group=resource_group, location = location, vnet=vnet, security_group=security_group, n_workers=initial_node_count, vm_size=vm_size, docker_image=base_dockerfile, docker_args="--privileged", auto_shutdown=False, security=False, env_vars={"EXTRA_PIP_PACKAGES": EXTRA_PIP_PACKAGES}, worker_class="dask_cuda.CUDAWorker")

I pass the "dask-cloudprovider[azure]" as an extra pip install with the hopes of syncing dask and distributed version between the local machine and the scheduler/workers

Do not understand where the mismatch might be

manuelreyesgomez avatar Mar 15 '21 17:03 manuelreyesgomez

By container you mean the docker_image used in the constructor?

Yes.

I pass the "dask-cloudprovider[azure]" as an extra pip install with the hopes of syncing dask and distributed version between the local machine and the scheduler/workers

This is not enough in itself. This will install dask-cloudprovider but as dask and distributed are already installed it will just go with those.

You could either check the package versions in the container and install them locally, or you could set EXTRA_PIP_PACKAGES to something like --upgrade dask distributed dask-cloudprovider[azure] to force an upgrade to the latest versions in the Docker image at runtime.

jacobtomlinson avatar Mar 15 '21 17:03 jacobtomlinson

you could set EXTRA_PIP_PACKAGES to something like --upgrade dask distributed dask-cloudprovider[azure] to force an upgrade to the latest versions in the Docker image at runtime.

Thanks would try that then

manuelreyesgomez avatar Mar 15 '21 17:03 manuelreyesgomez

@jacobtomlinson It seemed to have worked, thanks

manuelreyesgomez avatar Mar 15 '21 18:03 manuelreyesgomez

Experimenting again similar behavior OSError: Timed out during handshake while connecting to tcp://104.215.83.219:8786 after 10 s

Seems to be a different underlying cause

distributed.protocol.core - CRITICAL - Failed to deserialize Traceback (most recent call last): File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/protocol/core.py", line 107, in loads small_payload = frames.pop() IndexError: pop from empty list Traceback (most recent call last): File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect handshake = await asyncio.wait_for(comm.read(), time_left()) File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/asyncio/tasks.py", line 494, in wait_for return fut.result() File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/comm/tcp.py", line 217, in read msg = await from_frames( File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/comm/utils.py", line 80, in from_frames res = _from_frames() File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/comm/utils.py", line 63, in _from_frames return protocol.loads( File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/protocol/core.py", line 107, in loads small_payload = frames.pop() IndexError: pop from empty list

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/dask_cloudprovider/azure/azurevm.py", line 496, in init super().init(**kwargs) File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/dask_cloudprovider/generic/vmcluster.py", line 284, in init super().init(**kwargs, security=self.security) File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/deploy/spec.py", line 281, in init self.sync(self._start) File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 189, in sync return sync(self.loop, func, *args, **kwargs) File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/utils.py", line 351, in sync raise exc.with_traceback(tb) File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/utils.py", line 334, in f result[0] = yield future File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/dask_cloudprovider/generic/vmcluster.py", line 324, in _start await super()._start() File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/deploy/spec.py", line 314, in _start await super()._start() File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 73, in _start comm = await self.scheduler_comm.live_comm() File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/core.py", line 746, in live_comm comm = await connect( File "/home/mreyesgomez/anaconda3/envs/AzureVMCluster_03_19/lib/python3.8/site-packages/distributed/comm/core.py", line 324, in connect raise IOError( OSError: Timed out during handshake while connecting to tcp://104.215.83.219:8786 after 10 s

Print output:

print(AzureVMCluster.get_cloud_init( ... resource_group=resource_group, location, vnet=... location = location, ... vnet=vnet, curity_... security_group=security_group, ... n_workers=n_workers, ... vm_size=vm_size, ... docker_image=base_dockerfile, ... docker_args="--privileged", auto_sh... auto_shutdown=False, ecurity=... security=False, ... env_vars={"EXTRA_PIP_PACKAGES": EXTRA_PIP_PACKAGES}, orker_class="d... worker_class="dask_cuda.CUDAWorker")) #cloud-config

Bootstrap

packages:

  • apt-transport-https
  • ca-certificates
  • curl
  • gnupg-agent
  • software-properties-common
  • ubuntu-drivers-common

Enable ipv4 forwarding, required on CIS hardened machines

write_files:

  • path: /etc/sysctl.d/enabled_ipv4_forwarding.conf content: | net.ipv4.conf.all.forwarding=1

create the docker group

groups:

  • docker

Add default auto created user to docker group

system_info: default_user: groups: [docker]

runcmd:

Install Docker

  • curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
  • add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
  • apt-get update -y
  • apt-get install -y docker-ce docker-ce-cli containerd.io
  • systemctl start docker
  • systemctl enable docker

Install NVIDIA driver

  • DEBIAN_FRONTEND=noninteractive ubuntu-drivers install

Install NVIDIA docker

  • curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
  • curl -s -L https://nvidia.github.io/nvidia-docker/$(. /etc/os-release;echo $ID$VERSION_ID)/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
  • apt-get update -y
  • apt-get install -y nvidia-docker2
  • systemctl restart docker

Run container

  • 'docker run --net=host --gpus=all -e EXTRA_PIP_PACKAGES="--upgrade dask distributed dask-cloudprovider[azure]" --privileged rapidsai/rapidsai:0.18-cuda11.0-runtime-ubuntu18.04-py3.8 dask-scheduler --version'

Python 3.8 dask:2021.03.0 distributed:2021.03.0 dask-cloudprovider:2021.3.0 installed with pip

manuelreyesgomez avatar Mar 31 '21 00:03 manuelreyesgomez

Thanks @manuelreyesgomez.

Deserialization issues tend to be either a Python or Dask version mismatch between your local environment and the docker image.

In terms of cleaning up after failure this is covered in #277

jacobtomlinson avatar Mar 31 '21 08:03 jacobtomlinson

@jacobtomlinson Thanks that was it, seems one needs to have a tight control on the package versions when doing pip installs on top the docker image. will comment on the cleanup after failure

manuelreyesgomez avatar Mar 31 '21 19:03 manuelreyesgomez