kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] Kuberay Operator TPU Worker Group Support

Open ryanaoleary opened this issue 1 year ago • 1 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

In order to support TPU multi-host with Kuberay, it is necessary to initialize the TPU_WORKER_ID and TPU_WORKER_HOSTNAMES environment variables within each TPU worker. To do so, a mutating admission webhook [1] can be used to inject these variables. The TPU_WORKER_HOSTNAMES variable is a joined list of individually addressable DNS names for each TPU worker, and relies on a backing headless service to provide these hostnames. Each TPU worker group needs its own headless service to ensure that the subdomains can be created. It would be useful for the Kuberay operator to manage creating/destroying headless services for worker groups requesting "google.com/tpu" resources. Any input on how this could best accomplished would be appreciated, the initial idea was to add a BuildHeadlessServiceForTPUs() function in ray-operator/controllers/ray/common/service.go that would return a headless service for each worker group requesting TPU resources.

Additionally, to support TPU worker groups, pods within the same worker group should be scheduled with affinity on the same pod slice. This can be accomplished by deriving a topology key from the GKE node pool name and scheduling all pods within a worker group on the corresponding TPU node pool.

[1] https://github.com/GoogleCloudPlatform/ai-on-gke/issues/114

Use case

No response

Related issues

https://github.com/ray-project/ray/issues/39781

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

ryanaoleary avatar Nov 13 '23 20:11 ryanaoleary