ray icon indicating copy to clipboard operation
ray copied to clipboard

[Docs][Kuberay] Documentation for Using Kuberay with TPUs

Open ryanaoleary opened this issue 1 year ago • 1 comments
trafficstars

Why are these changes needed?

Add documentation for users seeking to use Kuberay with TPUs on GKE, similar to the existing documentation for GPUs. This PR depends on example code added in https://github.com/ray-project/serve_config_examples/pull/8 and https://github.com/ray-project/kuberay/pull/2198.

Related issue number

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

ryanaoleary avatar Jun 13 '24 06:06 ryanaoleary

/cc

andrewsykim avatar Jun 13 '24 17:06 andrewsykim

image

@ryanaoleary can you fix the CI error?

kevin85421 avatar Jul 25 '24 05:07 kevin85421

image

@ryanaoleary can you fix the CI error?

Fixed in be33993.

image

ryanaoleary avatar Jul 25 '24 10:07 ryanaoleary

I manually tested the guide:

$ gcloud container clusters create andrewsy-kuberay-tpu-cluster     --addons RayOperator     --cluster-version=1.30     --location=us-central2-b --project <my-project>
...
...
kubeconfig entry generated for andrewsy-kuberay-tpu-cluster.
NAME                          LOCATION       MASTER_VERSION      MASTER_IP     MACHINE_TYPE   NODE_VERSION        NUM_NODES  STATUS
andrewsy-kuberay-tpu-cluster  us-central2-b  1.30.2-gke.1587000  35.186.20.75  n1-standard-1  1.30.2-gke.1587000  3          RUNNING
$ gcloud container node-pools create tpu-pool   --zone us-central2-b   --cluster andrewsy-kuberay-tpu-cluster   --num-nodes 1     --machine-type ct4p-hightpu-4t   --tpu-topology 2x2x1 --project tpu-vm-gke-testing
Creating node pool tpu-pool...done.
...
...
NAME      MACHINE_TYPE     DISK_SIZE_GB  NODE_VERSION
tpu-pool  ct4p-hightpu-4t  100           1.30.2-gke.1587000
$ kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.tpu-single-host.yaml
rayservice.ray.io/stable-diffusion-tpu created
$ kubectl get po
NAME                                                      READY   STATUS    RESTARTS      AGE
e-diffusion-tpu-raycluster-d9g7f-worker-tpu-group-mqc5p   1/1     Running   3 (22m ago)   30m
stable-diffusion-tpu-raycluster-d9g7f-head-vflpt          1/1     Running   0             30m
$ kubectl get svc
NAME                                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                         AGE
kubernetes                                       ClusterIP   34.118.224.1     <none>        443/TCP                                         43m
stable-diffusion-tpu-head-svc                    ClusterIP   34.118.230.109   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   20m
stable-diffusion-tpu-raycluster-d9g7f-head-svc   ClusterIP   34.118.230.159   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   31m
stable-diffusion-tpu-serve-svc                   ClusterIP   34.118.234.225   <none>        8000/TCP                                        20m
$ python stable_diffusion_tpu_req.py  --save_pictures
num_requests:  8
batch_size:  8
url:  http://localhost:8000/imagine
save_pictures:  True
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
  0%|                                                                                                                                                | 0/8 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [02:49<00:00, 21.16s/it]

andrewsykim avatar Jul 25 '24 17:07 andrewsykim

Thanks! I will request our doc team to review this PR.

kevin85421 avatar Jul 25 '24 17:07 kevin85421

2. PodSlice

I switched all usages of pod -> Pod, but I think we should keep the naming as "Pod slice" rather than "PodSlice" since that's what the GKE documentation uses (i.e. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#slices)

ryanaoleary avatar Jul 30 '24 20:07 ryanaoleary

@ryanaoleary would you mind fixing the CI error? Thanks!

kevin85421 avatar Aug 01 '24 22:08 kevin85421

@ryanaoleary would you mind fixing the CI error? Thanks!

Can you paste the CI error? I'm not able to see it.

ryanaoleary avatar Aug 01 '24 23:08 ryanaoleary

@can-anyscale @angelinalg Do you have any idea why the build was canceled? Screenshot 2024-08-02 at 2 41 35 PM

kevin85421 avatar Aug 02 '24 21:08 kevin85421

oh please rebase to the latest; build in readthedocs were timed out recently, but should already been fixed in the master branch

can-anyscale avatar Aug 07 '24 18:08 can-anyscale

Just waiting for a rebuild to finish reviewing.

angelinalg avatar Aug 07 '24 18:08 angelinalg

I'm just waiting for a successful docs build to verify the changes. Thanks!

angelinalg avatar Aug 07 '24 18:08 angelinalg

I just pulled in the latest changes, it looks like the build for e2fe072 was passing the CI @angelinalg

ryanaoleary avatar Aug 14 '24 20:08 ryanaoleary