distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Wrong report of currently executed tasks when using --nworkers N, with N>1

Open jonded94 opened this issue 11 months ago • 0 comments

Describe the issue:

Executing a Dask worker with >1 processes (aka --nworkers [N] with N>1) will lead to wrong data in both the worker dashboard and especially the Prometheus metrics at /metrics.

Minimal Complete Verifiable Example:

Running a dask worker in k8s with this config

spec:
  containers:
  - args:
    - dask
    - worker
    - --name
    - $(DASK_WORKER_NAME)
    - --dashboard
    - --dashboard-address
    - "8788"
    - --nworkers
    - "15"
    - --nthreads
    - "1"

and fully loading all worker processes (i.e. 15 running tasks per worker) will lead to these worker dashboard numbers. image

So it's reporting that it is only executing 1 task while actually, this is currently churning through 15 tasks. Also, the Prometheus metrics will say this

# HELP dask_worker_tasks Number of tasks at worker.
# TYPE dask_worker_tasks gauge
dask_worker_tasks{state="memory"} 3.0
dask_worker_tasks{state="executing"} 1.0
# HELP dask_worker_concurrent_fetch_requests Deprecated: This metric has been renamed to transfer_incoming_count.\nNumber of open fetch requests to other workers
# TYPE dask_worker_concurrent_fetch_requests gauge
dask_worker_concurrent_fetch_requests 0.0
# HELP dask_worker_threads Number of worker threads
# TYPE dask_worker_threads gauge
dask_worker_threads 1.0

Aggregating these metrics with Grafana will give wildy wrong numbers (out by a factor of 15 in my case).

Environment:

  • Dask version: 2024.2.1
  • Python version: 3.9.18
  • Operating System: Linux
  • Install method (conda, pip, source): poetry

jonded94 avatar Feb 29 '24 15:02 jonded94