ray [Core] Enable Scaling Down for Multi-Host TPU Replicas

trafficstars

Why are these changes needed?

Adds support for Ray autoscaler and Kuberay NodeProvider to scale-down TPU podslices. TPU podslices are atomic, so it is necessary to scale down all Ray nodes belonging to a TPU podslice together. This PR associates nodes with the replica (representing a podslice) of the TPU worker group they belong to using a replicaIndex Pod label which is set through a GKE webhook. When a TPU node is deleted, other nodes in that replica (tracked through a mapping) are scheduled to delete as well.

Related PR: https://github.com/ray-project/ray/pull/45105

Related issue number

Checks

[x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[x] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [x] Unit tests
- [x] Manual tests
- [ ] Release tests
- [ ] This PR is not tested :(

Feb 27 '24 21:02 ryanaoleary

This is on a critical code path. We should have more testing. Let's discuss it in today's sync.

May 30 '24 17:05 kevin85421

This PR was manually tested as follows:

Prerequisites:

GKE cluster with TPU quota and Node Autoprovisioning enabled, or a v4 2x2x2 TPU nodepool already created.
Ray TPU initialization webhook installed in-cluster.
Kuberay operator v1.1.1 installed in-cluster.

Testing:

Build Ray from source and replace autoscaling image in below RayCluster with one containing these changes.
Apply autoscaler template with detached actor scripts, respectively edited to include a TPU worker group and request resources={"TPU": 4}.

Detached Actor:

  import ray
  import sys

  @ray.remote(num_cpus=1, resources={"TPU": 4})
  class Actor:
    pass

  ray.init(namespace="default_namespace")
  Actor.options(name=sys.argv[1], lifetime="detached").remote()

TPU workerg group:

  - replicas: 0
  minReplicas: 0
  maxReplicas: 2
  numOfHosts: 2
  groupName: tpu-group
  rayStartParams: 
    resources: '"{\"TPU\": 4}"'
   ...
   requests:
            cpu: "1"
            ephemeral-storage: 10Gi
            google.com/tpu: "4"
            memory: 40G
      nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
          cloud.google.com/gke-tpu-topology: 2x2x2

Scale up two TPU workers using detached actors with a resource request of "TPU: 4" each. The autoscaler will scale up 1 replica of the tpu-group worker group to meet this request, which will create 2 workers since numOfHosts: 2:

export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor1
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py actor2

kubectl_get_pods

kubectl describe worker Pods to verify they are created with the GKE set replicaIndex label for multi-host workers:
Delete one of the detached actors causing the node to become idle:

kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/terminate_detached_actor.py actor1

one_actor_dead

Once the node is marked as idle, the autoscaler should terminate the node. The BatchingNodeProvider will detect the replicaIndex label on each node, and scale down the other worker in the replica at the same time:

Both workers deleted (there is still one detached actor alive requesting TPUs so a new multi-host group is then scaled back up):

nodes_terminating

Autoscaler logs:

autoscaler_logs_tpu_scale_down

Jun 28 '24 23:06 ryanaoleary

Could you share more details about the detached actor and add more details about why you expect the cluster to look like this at each step?

Sure, I edited the comment to include more detail.

Jul 01 '24 17:07 ryanaoleary

@ryanaoleary could you also rebase your branch to fix the CI error? Thanks!

Jul 02 '24 00:07 kevin85421

@can-anyscale could you retry the failed test? It is unrelated to this PR. Thanks!

Jul 02 '24 16:07 kevin85421

The RLLib tests fail after retry, but I don't think that is related to this PR because this PR is only for KubeRay. cc @jjyao @can-anyscale

Jul 02 '24 18:07 kevin85421

ray ray copied to clipboard

[Core] Enable Scaling Down for Multi-Host TPU Replicas

Why are these changes needed?

Related issue number

Checks

Prerequisites:

Testing:

ray
ray copied to clipboard