ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] Enable Scaling Down for Multi-Host TPU Replicas

Open ryanaoleary opened this issue 1 year ago • 2 comments

Why are these changes needed?

Adds support for Ray autoscaler and Kuberay NodeProvider to scale-down TPU podslices. TPU podslices are atomic, so it is necessary to scale down all Ray nodes belonging to a TPU podslice together. This PR associates nodes with the replica (representing a podslice) of the TPU worker group they belong to using a replicaIndex Pod label which is set through a GKE webhook. When a TPU node is deleted, other nodes in that replica (tracked through a mapping) are scheduled to delete as well.

Related PR: https://github.com/ray-project/ray/pull/45105

Related issue number

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [x] Manual tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

ryanaoleary avatar Feb 27 '24 21:02 ryanaoleary