distributed `dask worker --nworkers -1` does not use all CPUs available

`dask worker --nworkers -1` does not use all CPUs available

Open arneyjfs opened this issue 7 months ago • 0 comments

Describe the issue:

The documentation says that if the argument passed to --nworkers is negative, then (CPU_COUNT + 1 + nworkers) is used for the number of processes. I have 2 machines in the cluster, both with the same specs (nproc = 256) however I do not get 256 workers.

Full CPU specs:

> lscpu | egrep 'Model name|Socket|Core|Thread|NUMA|CPU\(s\)'
CPU(s):                             256
On-line CPU(s) list:                0-255
Model name:                         AMD EPYC 7713 64-Core Processor
Thread(s) per core:                 2
Core(s) per socket:                 64
Socket(s):                          2
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-63,128-191
NUMA node1 CPU(s):                  64-127,192-255

The number i actually get seems to fluctuate each time I run the dask worker command. One server normally starts with around 210 workers, and the other with around 70 workers, but this changes. The UI therefore reports about 280 workers in total with 1 thread each.

Firstly, why the variability? And secondly, how can I maximise this count? The workloads I need to run are simple single process medium length tasks.

Minimal Complete Verifiable Example:

# Scheduler
dask scheduler --host X.X.X.7

# Workers
dask worker tcp://X.X.X.7:8786 --nworkers -1

Anything else we need to know?: Things I've ruled out:

setting --nworkers explicitly to 256 behaves the same as -1
also specifying --nthreads makes no change
halfing the nworkers to 128 works - I get 128 workers with 2 threads each. Is dask hitting an upper limit here? If so why is it vaiable between 70-210?
nothing else is consuming significant resources on the servers

Environment:

Dask version: 2024.6.2
Python version: 3.10.12
Operating System: Ubuntu 22.04.4 LTS
Install method (conda, pip, source): pip

Jul 15 '24 15:07 arneyjfs

distributed distributed copied to clipboard

`dask worker --nworkers -1` does not use all CPUs available

distributed
distributed copied to clipboard