Workers killed by signal 6/9 and timeout errors during cluster shutdown with `LocalCUDACluster`

Open leekaimao opened this issue 4 months ago • 0 comments

I'm using dask-cuda's LocalCUDACluster for GPU-based distributed computing in a Python script. While the computation completes successfully, I encounter multiple errors during the shutdown phase.

Specifically, after calling cluster.close() and attempting to gracefully shut down the Dask cluster, I see repeated logs like:

distributed.nanny - INFO - Worker process XXX was killed by signal 6
...
distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
...
distributed.nanny - INFO - Worker process XXX was killed by signal 9

Additionally, I get a traceback indicating a TimeoutError during internal cluster state correction:

tornado.application - ERROR - Exception in callback ...
TimeoutError

And finally, a memory-related error from tcmalloc:

src/tcmalloc.cc:284] Attempt to free invalid pointer 0x...

Environment Setup:

Using LocalCUDACluster with explicit GPU device configuration.
Disabled Dask optimizations (optimization.fuse.active=False) and set conservative memory thresholds.
Workers are configured with device_memory_limit="80GB" and threads_per_worker=1.
Client and cluster are manually closed at the end of execution.

Code Snippet:

dask.config.set({"optimization.fuse.active": False})
dask.config.set({
    "distributed.worker.memory.target": 0.6,
    "distributed.worker.memory.spill": 0.7,
    "distributed.worker.memory.pause": 0.8,
    "distributed.worker.memory.terminate": 0.9,
    "distributed.comm.timeouts.connect": "300s",
    "distributed.comm.timeouts.tcp": "300s",
    "distributed.worker.daemon": False,
    "distributed.nanny.timeout": "60s"
})

cluster = LocalCUDACluster(
    CUDA_VISIBLE_DEVICES=cuda_devices,
    device_memory_limit="80GB",
    n_workers=n_workers,
    threads_per_worker=1,
    dashboard_address=':0',
    jit_unspill=False,
    silence_logs=False
)

client = Client(cluster, timeout='60s')
client.wait_for_workers(n_workers, timeout=120)

# ... computation ...

cluster.close(timeout=300)

Expected Behavior:

Graceful shutdown of workers and scheduler without force-killing or timeout errors.

Actual Behavior:

Workers are terminated forcefully with signals 6 and 9, followed by timeout and memory-related errors during shutdown.

Environment:

Dask version: 2024.12.1
Dask-CUDA version: 25.2.0
Python version: 3.12
OS: Linux (assumed)
Relevant packages: cudf, cupy, torch, distributed, etc.

Question:

Is this expected behavior? Are there additional configurations or best practices to ensure clean shutdown of GPU clusters in Dask?

Any help or guidance would be greatly appreciated!

Aug 15 '25 01:08 leekaimao