dask-cloudprovider icon indicating copy to clipboard operation
dask-cloudprovider copied to clipboard

Cannot spin up ECS GPU worker with current versions

Open cdc97 opened this issue 2 years ago • 3 comments

Describe the issue: My GPU worker cannot start via the

"command": [
                            "dask-cuda-worker" if self._worker_gpu else "dask-worker",
                            "--nthreads",
                            "{}".format(
                                max(int(self._worker_cpu / 1024), 1)
                                if self._worker_nthreads is None
                                else self._worker_nthreads
                            ),
                            "--memory-limit",
                            "{}MB".format(int(self._worker_mem)),
                            "--death-timeout",
                            "60",
                        ]

that gets passed in from ecs.py. Dask-cuda seems to have removed the --death-timeout option, so upon startup of the worker, I see

Usage: dask-cuda-worker [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...
Try 'dask-cuda-worker --help' for help.

Error: Got unexpected extra argument: (60)

I'm unfortunately trying to run this from prefect, so I can't pin to a low enough version of dask-cuda and distributed that do have this argument specified. When I try to pin to a low enough version on the scheduler/worker container, the more recent distributed version on the prefect agent container doesn't seem to play nice with the scheduler/worker with the error (from the prefect agent):

2022-10-13 21:25:50,708 - distributed.protocol.core - CRITICAL - Failed to deserialize
2022-10-13 14:25:50Traceback (most recent call last):
2022-10-13 14:25:50File "/usr/local/lib/python3.9/site-packages/distributed/protocol/core.py", line 158, in loads
2022-10-13 14:25:50return msgpack.loads(
2022-10-13 14:25:50File "msgpack/_unpacker.pyx", line 205, in msgpack._cmsgpack.unpackb
2022-10-13 14:25:50ValueError: Unpack failed: incomplete input
2022-10-13 14:22:1721:22:17.913 | INFO | prefect.task_runner.dask - Creating a new Dask cluster with `__prefect_loader__.<lambda>`

Minimal Complete Verifiable Example: run a docker image with the following, and you should see the error.

RUN pip install prefect distributed dask-cuda dask
CMD ["dask-cuda-worker", "--nthreads", "1", "--death-timeout", "60"]

Anything else we need to know?:

Environment:

dask                          2022.9.2
dask-cuda                     22.10.0
distributed                   2022.9.2
prefect                       2.6.0
  • Dask version:
  • Python version: Python 3.8.0
  • Operating System:
  • Install method (conda, pip, source): pip

cdc97 avatar Oct 13 '22 22:10 cdc97

Looks like this was removed in https://github.com/rapidsai/dask-cuda/pull/563. cc @charlesbluca @pentschev

I'm surprised this hasn't come up until now.

jacobtomlinson avatar Oct 14 '22 16:10 jacobtomlinson

I'm not familiar with dask-cloudprovider, is --death-timeout something generally important or is it something that it can live without?

pentschev avatar Oct 14 '22 18:10 pentschev

I would say it is an important feature. One of dask-cloudproviders goals is to fail cheaply, so if a worker cannot connect to a scheduler after a timeout it should shutdown/terminate to save money.

jacobtomlinson avatar Oct 18 '22 10:10 jacobtomlinson