ray
ray copied to clipboard
[Docs][Kuberay] Documentation for Using Kuberay with TPUs
Why are these changes needed?
Add documentation for users seeking to use Kuberay with TPUs on GKE, similar to the existing documentation for GPUs. This PR depends on example code added in https://github.com/ray-project/serve_config_examples/pull/8 and https://github.com/ray-project/kuberay/pull/2198.
Related issue number
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [x] I've run
scripts/format.shto lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
/cc
@ryanaoleary can you fix the CI error?
@ryanaoleary can you fix the CI error?
Fixed in be33993.
I manually tested the guide:
$ gcloud container clusters create andrewsy-kuberay-tpu-cluster --addons RayOperator --cluster-version=1.30 --location=us-central2-b --project <my-project>
...
...
kubeconfig entry generated for andrewsy-kuberay-tpu-cluster.
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
andrewsy-kuberay-tpu-cluster us-central2-b 1.30.2-gke.1587000 35.186.20.75 n1-standard-1 1.30.2-gke.1587000 3 RUNNING
$ gcloud container node-pools create tpu-pool --zone us-central2-b --cluster andrewsy-kuberay-tpu-cluster --num-nodes 1 --machine-type ct4p-hightpu-4t --tpu-topology 2x2x1 --project tpu-vm-gke-testing
Creating node pool tpu-pool...done.
...
...
NAME MACHINE_TYPE DISK_SIZE_GB NODE_VERSION
tpu-pool ct4p-hightpu-4t 100 1.30.2-gke.1587000
$ kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.tpu-single-host.yaml
rayservice.ray.io/stable-diffusion-tpu created
$ kubectl get po
NAME READY STATUS RESTARTS AGE
e-diffusion-tpu-raycluster-d9g7f-worker-tpu-group-mqc5p 1/1 Running 3 (22m ago) 30m
stable-diffusion-tpu-raycluster-d9g7f-head-vflpt 1/1 Running 0 30m
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 34.118.224.1 <none> 443/TCP 43m
stable-diffusion-tpu-head-svc ClusterIP 34.118.230.109 <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 20m
stable-diffusion-tpu-raycluster-d9g7f-head-svc ClusterIP 34.118.230.159 <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 31m
stable-diffusion-tpu-serve-svc ClusterIP 34.118.234.225 <none> 8000/TCP 20m
$ python stable_diffusion_tpu_req.py --save_pictures
num_requests: 8
batch_size: 8
url: http://localhost:8000/imagine
save_pictures: True
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
Handling connection for 8000
0%| | 0/8 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [02:49<00:00, 21.16s/it]
Thanks! I will request our doc team to review this PR.
2. PodSlice
I switched all usages of pod -> Pod, but I think we should keep the naming as "Pod slice" rather than "PodSlice" since that's what the GKE documentation uses (i.e. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#slices)
@ryanaoleary would you mind fixing the CI error? Thanks!
@ryanaoleary would you mind fixing the CI error? Thanks!
Can you paste the CI error? I'm not able to see it.
@can-anyscale @angelinalg Do you have any idea why the build was canceled?
oh please rebase to the latest; build in readthedocs were timed out recently, but should already been fixed in the master branch
Just waiting for a rebuild to finish reviewing.
I'm just waiting for a successful docs build to verify the changes. Thanks!
I just pulled in the latest changes, it looks like the build for e2fe072 was passing the CI @angelinalg
