dask-cloudprovider
dask-cloudprovider copied to clipboard
Cannot spin up ECS GPU worker with current versions
Describe the issue: My GPU worker cannot start via the
"command": [
"dask-cuda-worker" if self._worker_gpu else "dask-worker",
"--nthreads",
"{}".format(
max(int(self._worker_cpu / 1024), 1)
if self._worker_nthreads is None
else self._worker_nthreads
),
"--memory-limit",
"{}MB".format(int(self._worker_mem)),
"--death-timeout",
"60",
]
that gets passed in from ecs.py
. Dask-cuda seems to have removed the --death-timeout
option, so upon startup of the worker, I see
Usage: dask-cuda-worker [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...
Try 'dask-cuda-worker --help' for help.
Error: Got unexpected extra argument: (60)
I'm unfortunately trying to run this from prefect, so I can't pin to a low enough version of dask-cuda and distributed that do have this argument specified. When I try to pin to a low enough version on the scheduler/worker container, the more recent distributed version on the prefect agent container doesn't seem to play nice with the scheduler/worker with the error (from the prefect agent):
2022-10-13 21:25:50,708 - distributed.protocol.core - CRITICAL - Failed to deserialize
2022-10-13 14:25:50Traceback (most recent call last):
2022-10-13 14:25:50File "/usr/local/lib/python3.9/site-packages/distributed/protocol/core.py", line 158, in loads
2022-10-13 14:25:50return msgpack.loads(
2022-10-13 14:25:50File "msgpack/_unpacker.pyx", line 205, in msgpack._cmsgpack.unpackb
2022-10-13 14:25:50ValueError: Unpack failed: incomplete input
2022-10-13 14:22:1721:22:17.913 | INFO | prefect.task_runner.dask - Creating a new Dask cluster with `__prefect_loader__.<lambda>`
Minimal Complete Verifiable Example: run a docker image with the following, and you should see the error.
RUN pip install prefect distributed dask-cuda dask
CMD ["dask-cuda-worker", "--nthreads", "1", "--death-timeout", "60"]
Anything else we need to know?:
Environment:
dask 2022.9.2
dask-cuda 22.10.0
distributed 2022.9.2
prefect 2.6.0
- Dask version:
- Python version: Python 3.8.0
- Operating System:
- Install method (conda, pip, source): pip
Looks like this was removed in https://github.com/rapidsai/dask-cuda/pull/563. cc @charlesbluca @pentschev
I'm surprised this hasn't come up until now.
I'm not familiar with dask-cloudprovider
, is --death-timeout
something generally important or is it something that it can live without?
I would say it is an important feature. One of dask-cloudprovider
s goals is to fail cheaply, so if a worker cannot connect to a scheduler after a timeout it should shutdown/terminate to save money.