kuberay
kuberay copied to clipboard
[Feature] Kuberay Operator TPU Worker Group Support
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
In order to support TPU multi-host with Kuberay, it is necessary to initialize the TPU_WORKER_ID
and TPU_WORKER_HOSTNAMES
environment variables within each TPU worker. To do so, a mutating admission webhook [1] can be used to inject these variables. The TPU_WORKER_HOSTNAMES
variable is a joined list of individually addressable DNS names for each TPU worker, and relies on a backing headless service to provide these hostnames. Each TPU worker group needs its own headless service to ensure that the subdomains can be created. It would be useful for the Kuberay operator to manage creating/destroying headless services for worker groups requesting "google.com/tpu" resources. Any input on how this could best accomplished would be appreciated, the initial idea was to add a BuildHeadlessServiceForTPUs()
function in ray-operator/controllers/ray/common/service.go
that would return a headless service for each worker group requesting TPU resources.
Additionally, to support TPU worker groups, pods within the same worker group should be scheduled with affinity on the same pod slice. This can be accomplished by deriving a topology key from the GKE node pool name and scheduling all pods within a worker group on the corresponding TPU node pool.
[1] https://github.com/GoogleCloudPlatform/ai-on-gke/issues/114
Use case
No response
Related issues
https://github.com/ray-project/ray/issues/39781
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!