distributed Improve work stealing for scaling situations

These are a few preliminary fixes which would close #4471. While this does not necessarily account for opportunity cost as I described in https://github.com/dask/distributed/issues/4471#issuecomment-861483765 it still remedies the situation by

Allow idle workers to be part of the decision when scheduling a task. the worker objective will take care of the rest and should assign it to the proper worker if it is not overloaded. That might backlash and should be tested in another context
The work stealing ratio calculation emits sentinel values in situations which are arguable not situations we should skip on. As I argued in https://github.com/dask/distributed/issues/4471#issuecomment-861483765 there are situations where we should consider stealing even if the task is incredibly cheap to compute since it might allow for further parallelism. In particular if the cluster is partially idling.
This is a minor tuning and not connected to the problem but I chose to not pick the thief based on round robin but on the worker objective, as we do for the initial decision. We might need to tweak this since it doesn't account for in-flight occupancy and might cause another overload of a worker.

Tests missing (TODO), feedback welcome

[x] Closes #4471
[ ] Tests added / passed
[x] Passes black distributed / flake8 distributed / isort distributed

Jun 15 '21 14:06 fjetter

I completely forgot, performance reports

Before

https://gistcdn.githack.com/fjetter/6e04f5ced602378977c8fea0037e2c31/raw/b8dbbd5ae962ede8c0853926649619714ae1ded3/gh4471.html

After

https://gistcdn.githack.com/fjetter/6e04f5ced602378977c8fea0037e2c31/raw/b8dbbd5ae962ede8c0853926649619714ae1ded3/gh4471_combined.html

Jun 15 '21 14:06 fjetter

FYI tests break because the stealing now behaves differently and the tests reflect that (good!). just wanted to get this out early in case smbd has input already.

Jun 15 '21 15:06 fjetter

Summarizing things a bit here, I'm seeing two changes:

We don't only look at workers with relevant dependencies, we also consider idle workers. I agree that this is a big benefit.
We don't do a round robin, but instead use the worker objective. I agree that this is often a benefit, I am concerned about introducing checks that depend on the number of workers for each task. I think that we may want to defend against this. There are a few options:
- Round-robin defends against this but results in suboptimal scheduling. One answer is that we keep doing round-robin until we identify that it results in user problems. I wouldn't be surprised if it didn't over the lifetime of a computation because we'll continue correcting things.
- Use worker objective if the number of theives is small (less than twenty) but fall back to round robin otherwise (we do this elsewhere)
- Select a random sample of the theives (maybe ten?) and then use the worker objective on that sample (idea from @gjoseph92 ). Maybe this gives us the best of both worlds?

Jun 17 '21 13:06 mrocklin

Change 2B: relax the sentinel values in steal_time_ratio. While it is ok do use a short path for no-dependency tasks to save compute cost, we should not disallow work stealing for generally small or expensive tasks (<0.005 or cost_multiplier > 100). I believe these two cases should not be short-circuited but rather be dealt with using the usual logic. If it is too expensive and doesn't payoff they are not stolen but there is a remote chance that it would still pay off to move them in extreme situations (as in this example). If we do not remove these sentinels, they are forever blacklisted and will never be stolen.

Select a random sample of the theives (maybe ten?) and then use the worker objective on that sample (idea from

I like this idea so much that I wish it was my own ;) https://github.com/dask/distributed/pull/4920#discussion_r651876940

Jun 17 '21 13:06 fjetter

we should not disallow work stealing for generally small or expensive tasks (<0.005 or cost_multiplier > 100). I believe these two cases should not be short-circuited but rather be dealt with using the usual logic

That's ok with me

I like this idea so much that I wish it was my own ;) #4920 (comment)

Whoops! Well then it sounds like it must be a great idea :)

Jun 17 '21 13:06 mrocklin

Whoops! Well then it sounds like it must be a great idea :)

I'm not entirely certain to be honest. This is one of the cases where I would really love some data. After all, what are the chances that a sampling would keep the one or two workers which carry dependencies? If I cannot back this up with data, I'm inclined to roll this back.

However, overall I'd really love to have some micro benchmarks on this to know how bad this actually is. The comments the two of you left behind about performance make 100% sense but it is all a bit fuzzy and sometimes hard to spot.

For instance, I would've thought that the following operations would be A) faster and B) identical but I was wrong. Twice.

Anyhow, this is offtopic but I would love to discuss the topic of benchmarking with a few ppl (maybe next dev meeting)

Jun 17 '21 13:06 fjetter

After all, what are the chances that a sampling would keep the one or two workers which carry dependencies?

Ah, my thinking (or rather Gabe's thinking) was that you would keep the workers with dependencies in the set, but then mix in a few others from outside of the set.

Jun 17 '21 13:06 mrocklin

Ah, my thinking (or rather Gabe's thinking) was that you would keep the workers with dependencies in the set

Right... that sounds straight forward. there is a saying in germany "Manchmal sieht man den Wald for lauter Baeumen nicht" which translates loosely to "Sometimes you loose sight of the forest because there are too many trees." No idea if this makes sense to you but I think that's a good idea :D (and better than mine, after all)

Jun 17 '21 13:06 fjetter

Sometimes you loose sight of the forest because there are too many trees

This saying is commonly known in English as well. It's a good saying.

Jun 17 '21 13:06 mrocklin

https://www.dictionary.com/browse/can-t-see-the-forest-for-the-trees#:~:text=An%20expression%20used%20of%20someone,the%20bill%20could%20never%20pass.%E2%80%9D

Jun 17 '21 13:06 mrocklin

(Code still needs to be updated)

I removed all the changes around decide_worker and solely focused on the blacklisting of cost_multiplier > 100 and duration < 0.005 tasks (which are often the same). Below you can see the stealing tab (tab includes fixes, PR upcoming) which shows the stealing activitiy and occupancy over time

main / default

Screenshot 2021-06-21 at 17 31 32

without blacklisting "expensive" tasks

Screenshot 2021-06-21 at 17 27 38

What we can see is that the occupancy (timeseries chart, top) is much more effectively balanced if the small tasks are not blacklisted. I do not have proper perf reports or screenshots but the task stream density behaves similarly that the well balanced occupancy is much denser.

Jun 21 '21 15:06 fjetter

I spoke with @fjetter about this after replicating some of this work in https://github.com/dask/distributed/pull/6115 .

It sounds like there isn't a major blocker here. I'm going to see if I can push it through today. @fjetter if there are any major concerns you have on this PR that weren't listed above then please let me know.

Apr 13 '22 14:04 mrocklin

Unit Test Results

      16 files ±  0       16 suites ±0 8h 3m 25s :stopwatch: + 51m 57s   2 737 tests +  3   2 620 :heavy_check_mark: - 32     81 :zzz: ±0   36 :x: +  35 21 781 runs +24 20 596 :heavy_check_mark: - 85 1 080 :zzz: +5 105 :x: +104

For more details on these failures, see this check.

Results for commit 31e3b9fe. ± Comparison against base commit 6a3cbd38.

Apr 13 '22 18:04 github-actions[bot]

if there are any major concerns you have on this PR that weren't listed above then please let me know.

Dropping the sentinels will emphasize the importance of proper measurements (https://github.com/dask/distributed/pull/6115#issuecomment-1098961206) which is where I ran out of time last year.

I myself would've only picked this up again after having a few solid benchmarks up and running. alternatively a lot of manual testing, maybe both.

Apr 14 '22 10:04 fjetter

distributed distributed copied to clipboard

Improve work stealing for scaling situations

main / default

without blacklisting "expensive" tasks

Unit Test Results

distributed
distributed copied to clipboard