Handle starting worker throttling inside worker pool
Signed-off-by: Jiajun Yao [email protected]
Why are these changes needed?
Currently, worker pool has throttling of how many workers can be started simultaneously (i.e. maximum_startup_concurrency_). Right now if a PopWorker call cannot be fulfilled due to throttling, it will fail and the caller (i.e. local task manger) will handle the retry. The issue is that when PopWorker fails, local task manager will release the resources claimed by the task. As a result, even though the node already has enough tasks to use up all the resources, it will still show available resources and attract more tasks than it can handle. Instead of letting local task manager handles the throttling, it should be handled internally in worker pool since throttling is a transient thing and is not a real error. It's effectively the same as longer worker startup time.
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
May be too difficult to write one that isn't flaky, but you could consider adding a Python test too for checking that the resource availability accounting is correct while workers are starting.
Release tests look good: https://buildkite.com/ray-project/release-tests-pr/builds/16008#_. Didn't see improvement or regression.