Doc: Explain how to use a RayJob with Kueue and ProvisioningRequest despite's GKE single PodSet limitation
Description
The documentation explains how to run RayJobs with Kueue and queued provisioning on GKE. The documented manifests only work when the RayJob only has a head node but no workers. If one adds workers, GKE rejects the ProvisioningRequest because it only supports a single PodSet per request currently.
This PR documents how to circumvent this issue.
Related issues
Closes #57839
Additional information
Created feature request to allow multiple podsets in GKE's issue tracker https://issuetracker.google.com/issues/452882313
https://github.com/ray-project/ray/pull/59068 seems to be an incorrect reference.
https://github.com/ray-project/ray/pull/59068 seems to be an incorrect reference.
Thanks for the catch, the id was from the pr template. Fixed.
cc @andrewsykim to review if you have time, thank you!
The documented manifests only work when the RayJob only has a head node but no workers.
@fg91 I don't think this is true, assuming the Head pod doesn't request GPUs. But let me know if you see otherwise
The documented manifests only work when the RayJob only has a head node but no workers.
@fg91 I don't think this is true, assuming the Head pod doesn't request GPUs. But let me know if you see otherwise
When I configure a head node without a GPU and a worker with a GPU, I see the error message mentioned in the linked issue:
Error creating ProvisioningRequest "rayjob-rayjob-sleep-test-062a5-dws-prov-1": admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by provisioning-request-cr-validation]":["the queued provisioning feature currently supports only single PodSet per request"]}
@fg91 thanks, let me share this internally at Google and see if it's expected behavior. Can you share the output of your ProvisiongRequest?
kubectl get provisioningrequest rayjob-rayjob-sleep-test-062a5-dws-prov-1 -o yaml
@fg91 @andrewsykim any progress here? I don't consider my comment as a blocker by any means.
Andrew is on vacation
@fg91 regarding this comment https://github.com/ray-project/ray/pull/59070#issuecomment-3604268162. actually when using ProvisioningRequest on GKE you should rather exclude CPU completele in the ProvisioningRequestConfig by using managedResources: nvidia.com/gpu as shown here: https://kueue.sigs.k8s.io/docs/concepts/admission_check/provisioning_request/#provisioningrequestconfig
Then the feature IdenticalWorkloadSchedulingRequirements is not meant for combining GPU and CPU PodSets. It is useful for combining PodSets using the same resource types. For example when the "head" PodSet is using GPU also.
cc @andrewsykim