distributed
distributed copied to clipboard
Improve work stealing for scaling situations
These are a few preliminary fixes which would close #4471. While this does not necessarily account for opportunity cost as I described in https://github.com/dask/distributed/issues/4471#issuecomment-861483765 it still remedies the situation by
- Allow idle workers to be part of the decision when scheduling a task. the worker objective will take care of the rest and should assign it to the proper worker if it is not overloaded. That might backlash and should be tested in another context
- The work stealing ratio calculation emits sentinel values in situations which are arguable not situations we should skip on. As I argued in https://github.com/dask/distributed/issues/4471#issuecomment-861483765 there are situations where we should consider stealing even if the task is incredibly cheap to compute since it might allow for further parallelism. In particular if the cluster is partially idling.
- This is a minor tuning and not connected to the problem but I chose to not pick the thief based on round robin but on the worker objective, as we do for the initial decision. We might need to tweak this since it doesn't account for in-flight occupancy and might cause another overload of a worker.
Tests missing (TODO), feedback welcome
- [x] Closes #4471
- [ ] Tests added / passed
- [x] Passes
black distributed
/flake8 distributed
/isort distributed
I completely forgot, performance reports
Before
https://gistcdn.githack.com/fjetter/6e04f5ced602378977c8fea0037e2c31/raw/b8dbbd5ae962ede8c0853926649619714ae1ded3/gh4471.html
After
https://gistcdn.githack.com/fjetter/6e04f5ced602378977c8fea0037e2c31/raw/b8dbbd5ae962ede8c0853926649619714ae1ded3/gh4471_combined.html
FYI tests break because the stealing now behaves differently and the tests reflect that (good!). just wanted to get this out early in case smbd has input already.
Summarizing things a bit here, I'm seeing two changes:
- We don't only look at workers with relevant dependencies, we also consider idle workers. I agree that this is a big benefit.
- We don't do a round robin, but instead use the worker objective. I agree that this is often a benefit, I am concerned about introducing checks that depend on the number of workers for each task. I think that we may want to defend against this. There are a few options:
- Round-robin defends against this but results in suboptimal scheduling. One answer is that we keep doing round-robin until we identify that it results in user problems. I wouldn't be surprised if it didn't over the lifetime of a computation because we'll continue correcting things.
- Use worker objective if the number of theives is small (less than twenty) but fall back to round robin otherwise (we do this elsewhere)
- Select a random sample of the theives (maybe ten?) and then use the worker objective on that sample (idea from @gjoseph92 ). Maybe this gives us the best of both worlds?
Change 2B: relax the sentinel values in steal_time_ratio
. While it is ok do use a short path for no-dependency tasks to save compute cost, we should not disallow work stealing for generally small or expensive tasks (<0.005 or cost_multiplier > 100). I believe these two cases should not be short-circuited but rather be dealt with using the usual logic. If it is too expensive and doesn't payoff they are not stolen but there is a remote chance that it would still pay off to move them in extreme situations (as in this example). If we do not remove these sentinels, they are forever blacklisted and will never be stolen.
Select a random sample of the theives (maybe ten?) and then use the worker objective on that sample (idea from
I like this idea so much that I wish it was my own ;) https://github.com/dask/distributed/pull/4920#discussion_r651876940
we should not disallow work stealing for generally small or expensive tasks (<0.005 or cost_multiplier > 100). I believe these two cases should not be short-circuited but rather be dealt with using the usual logic
That's ok with me
I like this idea so much that I wish it was my own ;) #4920 (comment)
Whoops! Well then it sounds like it must be a great idea :)
Whoops! Well then it sounds like it must be a great idea :)
I'm not entirely certain to be honest. This is one of the cases where I would really love some data. After all, what are the chances that a sampling would keep the one or two workers which carry dependencies? If I cannot back this up with data, I'm inclined to roll this back.
However, overall I'd really love to have some micro benchmarks on this to know how bad this actually is. The comments the two of you left behind about performance make 100% sense but it is all a bit fuzzy and sometimes hard to spot.
For instance, I would've thought that the following operations would be A) faster and B) identical
but I was wrong. Twice.
Anyhow, this is offtopic but I would love to discuss the topic of benchmarking with a few ppl (maybe next dev meeting)
After all, what are the chances that a sampling would keep the one or two workers which carry dependencies?
Ah, my thinking (or rather Gabe's thinking) was that you would keep the workers with dependencies in the set, but then mix in a few others from outside of the set.
Ah, my thinking (or rather Gabe's thinking) was that you would keep the workers with dependencies in the set
Right... that sounds straight forward. there is a saying in germany "Manchmal sieht man den Wald for lauter Baeumen nicht" which translates loosely to "Sometimes you loose sight of the forest because there are too many trees." No idea if this makes sense to you but I think that's a good idea :D (and better than mine, after all)
Sometimes you loose sight of the forest because there are too many trees
This saying is commonly known in English as well. It's a good saying.
https://www.dictionary.com/browse/can-t-see-the-forest-for-the-trees#:~:text=An%20expression%20used%20of%20someone,the%20bill%20could%20never%20pass.%E2%80%9D
(Code still needs to be updated)
I removed all the changes around decide_worker and solely focused on the blacklisting of cost_multiplier > 100
and duration < 0.005
tasks (which are often the same). Below you can see the stealing tab (tab includes fixes, PR upcoming) which shows the stealing activitiy and occupancy over time
main / default
without blacklisting "expensive" tasks
What we can see is that the occupancy (timeseries chart, top) is much more effectively balanced if the small tasks are not blacklisted. I do not have proper perf reports or screenshots but the task stream density behaves similarly that the well balanced occupancy is much denser.
I spoke with @fjetter about this after replicating some of this work in https://github.com/dask/distributed/pull/6115 .
It sounds like there isn't a major blocker here. I'm going to see if I can push it through today. @fjetter if there are any major concerns you have on this PR that weren't listed above then please let me know.
Unit Test Results
16 files ± 0 16 suites ±0 8h 3m 25s :stopwatch: + 51m 57s 2 737 tests + 3 2 620 :heavy_check_mark: - 32 81 :zzz: ±0 36 :x: + 35 21 781 runs +24 20 596 :heavy_check_mark: - 85 1 080 :zzz: +5 105 :x: +104
For more details on these failures, see this check.
Results for commit 31e3b9fe. ± Comparison against base commit 6a3cbd38.
if there are any major concerns you have on this PR that weren't listed above then please let me know.
Dropping the sentinels will emphasize the importance of proper measurements (https://github.com/dask/distributed/pull/6115#issuecomment-1098961206) which is where I ran out of time last year.
I myself would've only picked this up again after having a few solid benchmarks up and running. alternatively a lot of manual testing, maybe both.