distributed icon indicating copy to clipboard operation
distributed copied to clipboard

workload not balancing during scale up on dask-gateway

Open chrisroat opened this issue 3 years ago • 6 comments

What happened:

I have a fully parallel workload of 30 tasks (no dependencies), each of which takes ~20 minutes. My cluster autoscales between 10 and 50 workers. When I start the task graph, I find the 30 jobs distributed 3 a-piece on each initial worker. Eventually, the cluster scales to 30 workers -- but sometimes the tasks will not redistribute.

Even after jobs finish and timing would be known, I have seen a situation where 7 remaining jobs distributed 2-2-2-1 on 4 workers, while there are plenty empty workers.

What you expected to happen:

As new workers come online, they steal tasks from existing workers.

Minimal Complete Verifiable Example:

I tried to replicate on a local cluster, but could not. Things seem to work as expected, and tasks are immediately redistributed.

Anything else we need to know?:

I'm attaching debug schedule/worker logs per @fjetter 's script, at the point where 30 workers are available and 10 workers each have 3 tasks.

Environment:

  • Dask version: 2021.12.0
  • Python version: 3.8.12
  • Operating System: Ubuntu 20
  • Install method (conda, pip, source): conda

scheduler_20211214.pkl.gz

worker_20211214.pkl.gz

chrisroat avatar Dec 14 '21 23:12 chrisroat

My workers and tasks do have resource constraints. In this case, I have 6 CPUs and 45 GB of memory -- and this particular task requests 45 GB. Would the recent fix 96d4fd43682c379f9b39b6dc55f568872b766a47 potentially solve this problem?

chrisroat avatar Dec 14 '21 23:12 chrisroat

I pushed a docker image to our cluster with 96d4fd43682c379f9b39b6dc55f568872b766a47 and it didn't seem to make a difference.

chrisroat avatar Dec 15 '21 01:12 chrisroat

@chrisroat this sounds very similar to https://github.com/dask/distributed/issues/5564 to me, which was recently fixed but not released yet. Can you try on main or with https://github.com/dask/distributed/pull/5572?

gjoseph92 avatar Dec 16 '21 21:12 gjoseph92

Agreed with @gjoseph92's point that this seems similar to https://github.com/dask/distributed/issues/5564. FWIW https://github.com/dask/distributed/pull/5572 was included in the distributed==2021.12.0 release, so you should be able to test things out by updating to that release too

jrbourbeau avatar Dec 16 '21 22:12 jrbourbeau

I reported this at 2021.12.0, and even later at 96d4fd4. In both cases, I'm seeing this behavior. I've attached the scheduler/worker dumps if that helps understand what is happening.

chrisroat avatar Dec 16 '21 22:12 chrisroat

Could the needs_info label be removed? I reported this at the release requested. I am starting to try and understand these large cluster dump files, and would appreciate any tips.

I am running at 2022.1.0 now, and I'm attaching more cluster dump logs. In this case, the workload doesn't fully balance, even without autoscaling.

  • it's a 3d map_overlap, so has lots of short upstream re-arranging tasks
  • stack-trim-getitem are the 680 fused tasks containing the long per-chunk calculation
  • I set the config default duration for stack-item-getitem to 500s, which in my local scheduler tests helps mitigate this issue
  • cluster has 50+ workers + 15 threads = 750+ threads
  • Not all 680 tasks are even ready -- some are stuck behind unfinished upstream re-arranging tasks
  • Only 12-13 threads being used
  • 1 task has lots of the "pre-overlap" stitching tasks

If there is even some patch/hack I can make in a personal fork, it would be acceptable.

dump_765b081abd78484495e033a613140c02_20220122.msgpack.gz

chrisroat avatar Jan 22 '22 16:01 chrisroat