distributed
distributed copied to clipboard
Wrong report of currently executed tasks when using --nworkers N, with N>1
Describe the issue:
Executing a Dask worker with >1 processes (aka --nworkers [N]
with N>1
) will lead to wrong data in both the worker dashboard and especially the Prometheus metrics at /metrics
.
Minimal Complete Verifiable Example:
Running a dask worker in k8s with this config
spec:
containers:
- args:
- dask
- worker
- --name
- $(DASK_WORKER_NAME)
- --dashboard
- --dashboard-address
- "8788"
- --nworkers
- "15"
- --nthreads
- "1"
and fully loading all worker processes (i.e. 15 running tasks per worker) will lead to these worker dashboard numbers.
So it's reporting that it is only executing 1 task while actually, this is currently churning through 15 tasks. Also, the Prometheus metrics will say this
# HELP dask_worker_tasks Number of tasks at worker.
# TYPE dask_worker_tasks gauge
dask_worker_tasks{state="memory"} 3.0
dask_worker_tasks{state="executing"} 1.0
# HELP dask_worker_concurrent_fetch_requests Deprecated: This metric has been renamed to transfer_incoming_count.\nNumber of open fetch requests to other workers
# TYPE dask_worker_concurrent_fetch_requests gauge
dask_worker_concurrent_fetch_requests 0.0
# HELP dask_worker_threads Number of worker threads
# TYPE dask_worker_threads gauge
dask_worker_threads 1.0
Aggregating these metrics with Grafana will give wildy wrong numbers (out by a factor of 15 in my case).
Environment:
- Dask version: 2024.2.1
- Python version: 3.9.18
- Operating System: Linux
- Install method (conda, pip, source): poetry