kueue
kueue copied to clipboard
Upgrade RayJob API to v1
What this PR does / why we need it:
KubeRay has promoted the RayJob API from v1alpha1 to v1 with KubeRay v1.0.0. While the v1alpha1 version is still served in KubeRay v1.1.0, it'll be removed in v1.2.0.
Which issue(s) this PR fixes:
Fixes #1335.
Special notes for your reviewer:
Supersedes #1435.
Does this PR introduce a user-facing change?
Upgrade RayJob API to v1
⚠️ If you use a KubeRay version older than v1.0.0, you'll have to upgrade your existing installation
to the most recent released version for it to remain compatible with Kueue.
Deploy Preview for kubernetes-sigs-kueue ready!
Name | Link |
---|---|
Latest commit | 379e4ba1b957d8e56842a260fa5f69c9d25a4a9a |
Latest deploy log | https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/660c39ecf54efc0008ae9628 |
Deploy Preview | https://deploy-preview-1802--kubernetes-sigs-kueue.netlify.app |
Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
Hey, apologies if I'm jumping the gun here, but I'm currently trying to get v1 RayJobs to work with Kueue - I pulled down your PR changes, and installed Kueue through helm, but I'm getting an error whenever I try to create a new RayJob now:
Error from server (InternalError): error when creating "ray-job.sample.yaml": Internal error occurred: failed calling webhook "mrayjob.kb.io": failed to call webhook: the server could not find the requested resource
Looking at the Kueue controller logs, it seems like the mutating (and I assume validating) webhook is still for v1alpha1 RayJobs:
{"level":"info","ts":"2024-03-12T22:31:38.10128149Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:158","msg":"Registering a mutating webhook","GVK":"ray.io/v1alpha1, Kind=RayJob","path":"/mutate-ray-io-v1alpha1-rayjob"}
Am I doing something wrong here? Possibly installing Kueue in the wrong way to utilize these changes?
Hey, apologies if I'm jumping the gun here, but I'm currently trying to get v1 RayJobs to work with Kueue - I pulled down your PR changes, and installed Kueue through helm, but I'm getting an error whenever I try to create a new RayJob now:
No pb at all :) Thanks a lot for testing this!
Looking at the Kueue controller logs, it seems like the mutating (and I assume validating) webhook is still for v1alpha1 RayJobs:
{"level":"info","ts":"2024-03-12T22:31:38.10128149Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:158","msg":"Registering a mutating webhook","GVK":"ray.io/v1alpha1, Kind=RayJob","path":"/mutate-ray-io-v1alpha1-rayjob"}
Am I doing something wrong here? Possibly installing Kueue in the wrong way to utilize these changes?
That "GVK":"ray.io/v1alpha1, Kind=RayJob"
really looks like it should not be there if the changes from that PR would be taken into account, specifically: https://github.com/kubernetes-sigs/kueue/pull/1802/files#diff-24b8d3ba7104004ba575becf8db0a4ee47c9062f716a744f1afbbdb7bfded0d9R50
How exactly it is you've tested it? Could you double check the deployment uses the container image that's built with the changes from that PR?
Ah whoops - realizing now that just running helm install ...
after pulling down your changes wasn't enough.
I ran make install deploy
, but now I'm getting the following error:
Warning Failed 14s (x2 over 29s) kubelet Failed to pull image "gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335": rpc error: code = NotFound desc = failed to pull and unpack image "gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335": failed to resolve reference "gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335": gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335: not found
I'm guessing I should build the new image locally and use that? Are there instructions for testing anywhere? Is make install deploy
enough to try out the changes once I get the image built?
Oof nevermind, sorry just found this: https://kueue.sigs.k8s.io/docs/installation/#build-and-install-from-source
I'll try this again
Ok, I built & installed from source and confirmed I'm now getting the v1 RayJob mutating & admission webhooks. kubectl -f apply
'd a RayJob, and the RayJob gets created, and so does the workload, but nothing else. And I'm seeing this in the kueue-controller logs:
{"level":"error","ts":"2024-03-13T17:27:04.793883358Z","logger":"cert-rotation","caller":"rotator/rotator.go:322","msg":"could not refresh CA and server certs","error":"Operation cannot be fulfilled on secrets \"kueue-webhook-server-cert\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:322\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:461\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:350\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:278\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223"}
Ok, I built & installed from source and confirmed I'm now getting the v1 RayJob mutating & admission webhooks.
kubectl -f apply
'd a RayJob, and the RayJob gets created, and so does the workload, but nothing else. And I'm seeing this in the kueue-controller logs:{"level":"error","ts":"2024-03-13T17:27:04.793883358Z","logger":"cert-rotation","caller":"rotator/rotator.go:322","msg":"could not refresh CA and server certs","error":"Operation cannot be fulfilled on secrets \"kueue-webhook-server-cert\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:322\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:461\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:350\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:278\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223"}
That error message happens when optimistic locking fails during an update, but the operation is simply retried so it's not an issue.
For the RayJob and the associated workload resource, checking their statuses should provide the reason why it's not started / running.
RayJob is stuck in Initializing
status:
Status:
Job Deployment Status: Initializing
Job Id: rayjob-sample-mpmnf
Ray Cluster Name: rayjob-sample-raycluster-c2hgb
Ray Cluster Status:
Desired CPU: 0
Desired GPU: 0
Desired Memory: 0
Desired TPU: 0
Head:
Start Time: 2024-03-13T19:43:01Z
Workload status doesn't show anything out of the ordinary:
Conditions:
Last Transition Time: 2024-03-13T19:43:01Z
Message: Quota reserved in ClusterQueue cluster-queue
Reason: QuotaReserved
Status: True
Type: QuotaReserved
Last Transition Time: 2024-03-13T19:43:01Z
Message: The workload is admitted
Reason: Admitted
Status: True
Type: Admitted
RayCluster is showing as suspended, though:
Normal Deleted 4m (x4 over 9m) raycluster-controller Deleted Pods for RayCluster default/rayjob-sample-raycluster-c2hgb due to suspension
Also seeing these errors in the kueue-controller logs (in addition to that other one above):
{"level":"error","ts":"2024-03-13T19:41:54.423509864Z","caller":"jobframework/reconciler.go:202","msg":"Removing finalizer","controller":"rayjob","controllerGroup":"ray.io","controllerKind":"RayJob","RayJob":{"name":"rayjob-sample","namespace":"default"},"namespace":"default","name":"rayjob-sample","reconcileID":"c3bc5a6e-4af2-46a9-9656-939200d808b5","job":"default/rayjob-sample","gvk":"ray.io/v1, Kind=RayJob","error":"Operation cannot be fulfilled on workloads.kueue.x-k8s.io \"rayjob-rayjob-sample-8f26e\": StorageError: invalid object, Code: 4, Key: /registry/kueue.x-k8s.io/workloads/default/rayjob-rayjob-sample-8f26e, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4c144f8e-b609-4495-84c1-3a71a73dc4d6, UID in object meta: ","stacktrace":"sigs.k8s.io/kueue/pkg/controller/jobframework.(*JobReconciler).ReconcileGenericJob\n\t/workspace/pkg/controller/jobframework/reconciler.go:202\nsigs.k8s.io/kueue/pkg/controller/jobframework.(*genericReconciler).Reconcile\n\t/workspace/pkg/controller/jobframework/reconciler.go:1003\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-03-13T19:41:54.423578204Z","caller":"controller/controller.go:329","msg":"Reconciler error","controller":"rayjob","controllerGroup":"ray.io","controllerKind":"RayJob","RayJob":{"name":"rayjob-sample","namespace":"default"},"namespace":"default","name":"rayjob-sample","reconcileID":"c3bc5a6e-4af2-46a9-9656-939200d808b5","error":"Operation cannot be fulfilled on workloads.kueue.x-k8s.io \"rayjob-rayjob-sample-8f26e\": StorageError: invalid object, Code: 4, Key: /registry/kueue.x-k8s.io/workloads/default/rayjob-rayjob-sample-8f26e, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4c144f8e-b609-4495-84c1-3a71a73dc4d6, UID in object meta: ","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
@mbzomowski I've been able to reproduce. That issue is actually unrelated to that PR and should be fixed with #1844.
Rebased with the new changes & everything works! Thanks so much @astefanutti!
@mbzomowski great, thanks for the feedback, and the thorough testing!
LGTM label has been added.
@alculquicondor @tenzen-y @andrewsykim @kevin85421 Do we want to get this in the next Kueue minor release, now that KubeRay v1.1.0 has been released?
@alculquicondor @tenzen-y @andrewsykim @kevin85421 Do we want to get this in the next Kueue minor release, now that KubeRay v1.1.0 has been released?
Based on this discussion: https://github.com/kubernetes-sigs/kueue/pull/1435#issuecomment-1878911426
I agree with the migration of KubeRay v1 in Kueue v0.7.
sgtm, please rebase
And put ACTION REQUIRED
in the release notes to ask users to upgrade to a version of kuberay that supports v1 (1.0?)
PR rebased, and release notes updated. PTAL, thanks.
@astefanutti Could you address CI errors?
/retest
@tenzen-y retesting as I think these were transient failures.
LGTM label has been added.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: astefanutti, tenzen-y
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [tenzen-y]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
@astefanutti Could you update documentation and examples in a follow-up?
- https://github.com/kubernetes-sigs/kueue/blob/main/site/content/en/docs/tasks/run/rayjobs.md
- https://github.com/kubernetes-sigs/kueue/blob/main/site/static/examples/jobs/ray-job-sample.yaml
@tenzen-y as far as I can see, the documentation and examples have been updated in the scope of that PR already, unless you have specific changes in mind?
On the other, while giving it another look, I've found another follow-up that's covered by #1941.
Sorry I'm late to review this, @astefanutti do we also want to update the go module version of kuberay? I ran into some corner cases where webhook logic drops fields because the imported type in Kueue doesn't have all the needed fields.
We should update to v1.1, but make sure it's backwards compatible with a cluster on v1.0
Oh ignore me, looks like it's using v1.1 already :)
@astefanutti Could you update documentation and examples in a follow-up?
- https://github.com/kubernetes-sigs/kueue/blob/main/site/content/en/docs/tasks/run/rayjobs.md
- https://github.com/kubernetes-sigs/kueue/blob/main/site/static/examples/jobs/ray-job-sample.yaml
@tenzen-y as far as I can see, the documentation and examples have been updated in the scope of that PR already, unless you have specific changes in mind?
On the other, while giving it another look, I've found another follow-up that's covered by #1941.
Oops, you're right. NVM
Oops, you're right. NVM
No pb, just wanted to make sure I was not missing anything!