ray
ray copied to clipboard
[Core] Enable Scaling Down for Multi-Host TPU Replicas
Why are these changes needed?
Adds support for Ray autoscaler and Kuberay NodeProvider to scale-down TPU podslices. TPU podslices are atomic, so it is necessary to scale down all Ray nodes belonging to a TPU podslice together. This PR associates nodes with the replica (representing a podslice) of the TPU worker group they belong to using a replicaIndex Pod label which is set through a GKE webhook. When a TPU node is deleted, other nodes in that replica (tracked through a mapping) are scheduled to delete as well.
Related PR: https://github.com/ray-project/ray/pull/45105
Related issue number
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [x] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [x] Manual tests
- [ ] Release tests
- [ ] This PR is not tested :(
This is on a critical code path. We should have more testing. Let's discuss it in today's sync.
This PR was manually tested as follows:
Prerequisites:
- GKE cluster with TPU quota and Node Autoprovisioning enabled, or a v4 2x2x2 TPU nodepool already created.
- Ray TPU initialization webhook installed in-cluster.
- Kuberay operator v1.1.1 installed in-cluster.
Testing:
- Build Ray from source and replace autoscaling image in below RayCluster with one containing these changes.
- Apply autoscaler template with detached actor scripts, respectively edited to include a TPU worker group and request
resources={"TPU": 4}.
- Detached Actor:
import ray
import sys
@ray.remote(num_cpus=1, resources={"TPU": 4})
class Actor:
pass
ray.init(namespace="default_namespace")
Actor.options(name=sys.argv[1], lifetime="detached").remote()
- TPU workerg group:
- replicas: 0
minReplicas: 0
maxReplicas: 2
numOfHosts: 2
groupName: tpu-group
rayStartParams:
resources: '"{\"TPU\": 4}"'
...
requests:
cpu: "1"
ephemeral-storage: 10Gi
google.com/tpu: "4"
memory: 40G
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
cloud.google.com/gke-tpu-topology: 2x2x2
- Scale up two TPU workers using detached actors with a resource request of "TPU: 4" each. The autoscaler will scale up 1 replica of the
tpu-groupworker group to meet this request, which will create 2 workers sincenumOfHosts: 2:
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor1
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor2
-
kubectl describeworker Pods to verify they are created with the GKE setreplicaIndexlabel for multi-host workers: -
Delete one of the detached actors causing the node to become idle:
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/terminate_detached_actor.py actor1
- Once the node is marked as idle, the autoscaler should terminate the node. The BatchingNodeProvider will detect the
replicaIndexlabel on each node, and scale down the other worker in the replica at the same time:
- Both workers deleted (there is still one detached actor alive requesting TPUs so a new multi-host group is then scaled back up):
- Autoscaler logs:
Could you share more details about the detached actor and add more details about why you expect the cluster to look like this at each step?
Sure, I edited the comment to include more detail.
@ryanaoleary could you also rebase your branch to fix the CI error? Thanks!
@can-anyscale could you retry the failed test? It is unrelated to this PR. Thanks!
The RLLib tests fail after retry, but I don't think that is related to this PR because this PR is only for KubeRay. cc @jjyao @can-anyscale