ray icon indicating copy to clipboard operation
ray copied to clipboard

Doc: Explain how to use a RayJob with Kueue and ProvisioningRequest despite's GKE single PodSet limitation

Open fg91 opened this issue 3 weeks ago • 10 comments

Description

The documentation explains how to run RayJobs with Kueue and queued provisioning on GKE. The documented manifests only work when the RayJob only has a head node but no workers. If one adds workers, GKE rejects the ProvisioningRequest because it only supports a single PodSet per request currently.

This PR documents how to circumvent this issue.

Related issues

Closes #57839

Additional information

Created feature request to allow multiple podsets in GKE's issue tracker https://issuetracker.google.com/issues/452882313

fg91 avatar Nov 29 '25 08:11 fg91

https://github.com/ray-project/ray/pull/59068 seems to be an incorrect reference.

aslonnie avatar Dec 01 '25 17:12 aslonnie

https://github.com/ray-project/ray/pull/59068 seems to be an incorrect reference.

Thanks for the catch, the id was from the pr template. Fixed.

fg91 avatar Dec 01 '25 17:12 fg91

cc @andrewsykim to review if you have time, thank you!

Future-Outlier avatar Dec 02 '25 16:12 Future-Outlier

The documented manifests only work when the RayJob only has a head node but no workers.

@fg91 I don't think this is true, assuming the Head pod doesn't request GPUs. But let me know if you see otherwise

andrewsykim avatar Dec 02 '25 20:12 andrewsykim

The documented manifests only work when the RayJob only has a head node but no workers.

@fg91 I don't think this is true, assuming the Head pod doesn't request GPUs. But let me know if you see otherwise

When I configure a head node without a GPU and a worker with a GPU, I see the error message mentioned in the linked issue:


Error creating ProvisioningRequest "rayjob-rayjob-sleep-test-062a5-dws-prov-1": admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
      Violations details: {"[denied by provisioning-request-cr-validation]":["the queued provisioning feature currently supports only single PodSet per request"]}

fg91 avatar Dec 02 '25 22:12 fg91

@fg91 thanks, let me share this internally at Google and see if it's expected behavior. Can you share the output of your ProvisiongRequest?

kubectl get provisioningrequest rayjob-rayjob-sleep-test-062a5-dws-prov-1 -o yaml

andrewsykim avatar Dec 03 '25 16:12 andrewsykim

@fg91 @andrewsykim any progress here? I don't consider my comment as a blocker by any means.

mimowo avatar Dec 10 '25 10:12 mimowo

Andrew is on vacation

Future-Outlier avatar Dec 10 '25 14:12 Future-Outlier

@fg91 regarding this comment https://github.com/ray-project/ray/pull/59070#issuecomment-3604268162. actually when using ProvisioningRequest on GKE you should rather exclude CPU completele in the ProvisioningRequestConfig by using managedResources: nvidia.com/gpu as shown here: https://kueue.sigs.k8s.io/docs/concepts/admission_check/provisioning_request/#provisioningrequestconfig

Then the feature IdenticalWorkloadSchedulingRequirements is not meant for combining GPU and CPU PodSets. It is useful for combining PodSets using the same resource types. For example when the "head" PodSet is using GPU also.

cc @andrewsykim

mimowo avatar Dec 10 '25 17:12 mimowo