distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Manually scale up adaptive clusters

Open jsignell opened this issue 5 years ago • 4 comments

There was some discussion on gitter about adaptive clusters and whether the user should be able to request a target number of workers.

Context

The rationale for wanting this kind of behavior is that you could be in a scenario where you know you'll need a lot at the beginning of a task, so you want to manually scale up, but then let the workers kill off as they finish.

From @d-v-b: This describes most reductions (e.g., min of a big array) and I found the adaptive cluster performs terribly for these workloads.

From @lesteve:

As I mentioned during my Dask-Jobqueue talk at the Dask workshop, my feeling is that a lot of use cases for Adaptive in a HPC context are embarrassingly parallel and have essentially two phases:

  1. scale up as fast as your HPC cluster lets you
  2. once computation nears towards the end, Dask workers that don't have tasks left should be killed quickly to free up resources.

I was planning to look at trying to do that manually at one point but never got around to it. If someone makes some progress in this direction, let me know!

Also from user feed-back (not first-hand experience) it feels like Adaptive has some problems and @d-v-b confirmed my impression at the workshop.

Proposal

I was originally thinking about this from the scheduler perspective (interacted with via the client object). I was imagining that you could set target on the scheduler and that would be used to convey the intention to the cluster (similar to the adaptive_target pattern). But @martindurant pointed out that you'd need some concept of how long the workers should hang around for before adaptive starts trying to tear them down. Perhaps the right approach would be you don't start spinning up the workers until some work is submitted (similar to Adaptive), but you wait to start work until target workers (or more likely min(maximum, target)) are available.

ref #3526

jsignell avatar Mar 20 '20 18:03 jsignell

It sounds like there was some good conversation. Thanks @jsignell for writing this up.

However, it also looks like that conversation has stopped. Should we close this issue, or is this still something that people are planning to discuss and possibly work on?

mrocklin avatar Apr 05 '20 17:04 mrocklin

I have no plans to work on it.

jsignell avatar Apr 06 '20 13:04 jsignell

I am afraid that's the same for me ...

May I suggest that if someone has some spare bandwidth to look at it (I know it may not be that likely), maybe a more well-defined problem to tackle is #3526.

#3526 highlights a clear bug in Adaptive that happens in real-life and prevents people from using Adaptive. #3526 also contains a stand-alone snippet to reproduce with a LocalCluster (thanks @d-v-b!) and I was able to reproduce it.

lesteve avatar Apr 06 '20 13:04 lesteve

To add more anecdotal evidence, I also find problems when I dump a large graph on an adaptive (k8s dask-gateway) cluster sitting at zero workers. As k8s scales up slowly, the first worker gets a large number of tasks assigned. For whatever reason, a lot of those tasks do not get stolen as more workers come online.

I'm literally staring at a cluster with 15 workers right now, and 1 worker has ~300 tasks while the rest have <10 each. It's just sitting there stuck for minutes on end. I now proceed to figure out what pod is the hog and kill it. That unlocks the cluster and it finishes rapidly.

This has been especially painful with graphs where the bulk of the work is in O(100) 45-minute tasks that are embarrassingly parallel. The task distribution ends up very poor, where only ~25% of these tasks get running, and the rest get backed up. For now, I find the 2-3 pods causing the bottleneck, and kill them. At that point, the distribution evens out and each worker gets 1 of the long-lived tasks.

Anecdotally, it seems like the scale up always targets just a single pod initially, and then only adds more workers after that worker begins. I tried looking through the code, but could not find logic to back up this observation.

I'm not sure if I want this "fast scale up" (or, at least a "don't assign work until N workers are running"), or a way to tweak the task stealing.

chrisroat avatar Dec 06 '20 03:12 chrisroat