distributed
distributed copied to clipboard
workload not balancing during scale up on dask-gateway
What happened:
I have a fully parallel workload of 30 tasks (no dependencies), each of which takes ~20 minutes. My cluster autoscales between 10 and 50 workers. When I start the task graph, I find the 30 jobs distributed 3 a-piece on each initial worker. Eventually, the cluster scales to 30 workers -- but sometimes the tasks will not redistribute.
Even after jobs finish and timing would be known, I have seen a situation where 7 remaining jobs distributed 2-2-2-1 on 4 workers, while there are plenty empty workers.
What you expected to happen:
As new workers come online, they steal tasks from existing workers.
Minimal Complete Verifiable Example:
I tried to replicate on a local cluster, but could not. Things seem to work as expected, and tasks are immediately redistributed.
Anything else we need to know?:
I'm attaching debug schedule/worker logs per @fjetter 's script, at the point where 30 workers are available and 10 workers each have 3 tasks.
Environment:
- Dask version: 2021.12.0
- Python version: 3.8.12
- Operating System: Ubuntu 20
- Install method (conda, pip, source): conda
My workers and tasks do have resource constraints. In this case, I have 6 CPUs and 45 GB of memory -- and this particular task requests 45 GB. Would the recent fix 96d4fd43682c379f9b39b6dc55f568872b766a47 potentially solve this problem?
I pushed a docker image to our cluster with 96d4fd43682c379f9b39b6dc55f568872b766a47 and it didn't seem to make a difference.
@chrisroat this sounds very similar to https://github.com/dask/distributed/issues/5564 to me, which was recently fixed but not released yet. Can you try on main or with https://github.com/dask/distributed/pull/5572?
Agreed with @gjoseph92's point that this seems similar to https://github.com/dask/distributed/issues/5564. FWIW https://github.com/dask/distributed/pull/5572 was included in the distributed==2021.12.0 release, so you should be able to test things out by updating to that release too
I reported this at 2021.12.0, and even later at 96d4fd4. In both cases, I'm seeing this behavior. I've attached the scheduler/worker dumps if that helps understand what is happening.
Could the needs_info label be removed? I reported this at the release requested. I am starting to try and understand these large cluster dump files, and would appreciate any tips.
I am running at 2022.1.0 now, and I'm attaching more cluster dump logs. In this case, the workload doesn't fully balance, even without autoscaling.
- it's a 3d map_overlap, so has lots of short upstream re-arranging tasks
stack-trim-getitemare the 680 fused tasks containing the long per-chunk calculation- I set the config default duration for stack-item-getitem to 500s, which in my local scheduler tests helps mitigate this issue
- cluster has 50+ workers + 15 threads = 750+ threads
- Not all 680 tasks are even ready -- some are stuck behind unfinished upstream re-arranging tasks
- Only 12-13 threads being used
- 1 task has lots of the "pre-overlap" stitching tasks
If there is even some patch/hack I can make in a personal fork, it would be acceptable.