kueue icon indicating copy to clipboard operation
kueue copied to clipboard

Upgrade RayJob API to v1

Open astefanutti opened this issue 11 months ago • 12 comments

What this PR does / why we need it:

KubeRay has promoted the RayJob API from v1alpha1 to v1 with KubeRay v1.0.0. While the v1alpha1 version is still served in KubeRay v1.1.0, it'll be removed in v1.2.0.

Which issue(s) this PR fixes:

Fixes #1335.

Special notes for your reviewer:

Supersedes #1435.

Does this PR introduce a user-facing change?

Upgrade RayJob API to v1

⚠️ If you use a KubeRay version older than v1.0.0, you'll have to upgrade your existing installation
to the most recent released version for it to remain compatible with Kueue.

astefanutti avatar Mar 05 '24 09:03 astefanutti

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
Latest commit 379e4ba1b957d8e56842a260fa5f69c9d25a4a9a
Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/660c39ecf54efc0008ae9628
Deploy Preview https://deploy-preview-1802--kubernetes-sigs-kueue.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Mar 05 '24 09:03 netlify[bot]

Hey, apologies if I'm jumping the gun here, but I'm currently trying to get v1 RayJobs to work with Kueue - I pulled down your PR changes, and installed Kueue through helm, but I'm getting an error whenever I try to create a new RayJob now:

Error from server (InternalError): error when creating "ray-job.sample.yaml": Internal error occurred: failed calling webhook "mrayjob.kb.io": failed to call webhook: the server could not find the requested resource

Looking at the Kueue controller logs, it seems like the mutating (and I assume validating) webhook is still for v1alpha1 RayJobs:

{"level":"info","ts":"2024-03-12T22:31:38.10128149Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:158","msg":"Registering a mutating webhook","GVK":"ray.io/v1alpha1, Kind=RayJob","path":"/mutate-ray-io-v1alpha1-rayjob"}

Am I doing something wrong here? Possibly installing Kueue in the wrong way to utilize these changes?

mbzomowski avatar Mar 12 '24 22:03 mbzomowski

Hey, apologies if I'm jumping the gun here, but I'm currently trying to get v1 RayJobs to work with Kueue - I pulled down your PR changes, and installed Kueue through helm, but I'm getting an error whenever I try to create a new RayJob now:

No pb at all :) Thanks a lot for testing this!

Looking at the Kueue controller logs, it seems like the mutating (and I assume validating) webhook is still for v1alpha1 RayJobs:

{"level":"info","ts":"2024-03-12T22:31:38.10128149Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:158","msg":"Registering a mutating webhook","GVK":"ray.io/v1alpha1, Kind=RayJob","path":"/mutate-ray-io-v1alpha1-rayjob"}

Am I doing something wrong here? Possibly installing Kueue in the wrong way to utilize these changes?

That "GVK":"ray.io/v1alpha1, Kind=RayJob" really looks like it should not be there if the changes from that PR would be taken into account, specifically: https://github.com/kubernetes-sigs/kueue/pull/1802/files#diff-24b8d3ba7104004ba575becf8db0a4ee47c9062f716a744f1afbbdb7bfded0d9R50

How exactly it is you've tested it? Could you double check the deployment uses the container image that's built with the changes from that PR?

astefanutti avatar Mar 13 '24 09:03 astefanutti

Ah whoops - realizing now that just running helm install ... after pulling down your changes wasn't enough.

I ran make install deploy, but now I'm getting the following error:

  Warning  Failed     14s (x2 over 29s)  kubelet            Failed to pull image "gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335": rpc error: code = NotFound desc = failed to pull and unpack image "gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335": failed to resolve reference "gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335": gcr.io/k8s-staging-kueue/kueue:v0.7.0-devel-33-g0414b335: not found

I'm guessing I should build the new image locally and use that? Are there instructions for testing anywhere? Is make install deploy enough to try out the changes once I get the image built?

mbzomowski avatar Mar 13 '24 17:03 mbzomowski

Oof nevermind, sorry just found this: https://kueue.sigs.k8s.io/docs/installation/#build-and-install-from-source

I'll try this again

mbzomowski avatar Mar 13 '24 17:03 mbzomowski

Ok, I built & installed from source and confirmed I'm now getting the v1 RayJob mutating & admission webhooks. kubectl -f apply'd a RayJob, and the RayJob gets created, and so does the workload, but nothing else. And I'm seeing this in the kueue-controller logs:

{"level":"error","ts":"2024-03-13T17:27:04.793883358Z","logger":"cert-rotation","caller":"rotator/rotator.go:322","msg":"could not refresh CA and server certs","error":"Operation cannot be fulfilled on secrets \"kueue-webhook-server-cert\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:322\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:461\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:350\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:278\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223"}

mbzomowski avatar Mar 13 '24 17:03 mbzomowski

Ok, I built & installed from source and confirmed I'm now getting the v1 RayJob mutating & admission webhooks. kubectl -f apply'd a RayJob, and the RayJob gets created, and so does the workload, but nothing else. And I'm seeing this in the kueue-controller logs:

{"level":"error","ts":"2024-03-13T17:27:04.793883358Z","logger":"cert-rotation","caller":"rotator/rotator.go:322","msg":"could not refresh CA and server certs","error":"Operation cannot be fulfilled on secrets \"kueue-webhook-server-cert\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded.func1\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:322\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:461\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).refreshCertIfNeeded\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:350\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/[email protected]/pkg/rotator/rotator.go:278\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223"}

That error message happens when optimistic locking fails during an update, but the operation is simply retried so it's not an issue.

For the RayJob and the associated workload resource, checking their statuses should provide the reason why it's not started / running.

astefanutti avatar Mar 13 '24 17:03 astefanutti

RayJob is stuck in Initializing status:

Status:
  Job Deployment Status:  Initializing
  Job Id:                 rayjob-sample-mpmnf
  Ray Cluster Name:       rayjob-sample-raycluster-c2hgb
  Ray Cluster Status:
    Desired CPU:     0
    Desired GPU:     0
    Desired Memory:  0
    Desired TPU:     0
    Head:
  Start Time:  2024-03-13T19:43:01Z

Workload status doesn't show anything out of the ordinary:

  Conditions:
    Last Transition Time:  2024-03-13T19:43:01Z
    Message:               Quota reserved in ClusterQueue cluster-queue
    Reason:                QuotaReserved
    Status:                True
    Type:                  QuotaReserved
    Last Transition Time:  2024-03-13T19:43:01Z
    Message:               The workload is admitted
    Reason:                Admitted
    Status:                True
    Type:                  Admitted

RayCluster is showing as suspended, though:

  Normal  Deleted  4m (x4 over 9m)  raycluster-controller  Deleted Pods for RayCluster default/rayjob-sample-raycluster-c2hgb due to suspension

Also seeing these errors in the kueue-controller logs (in addition to that other one above):

{"level":"error","ts":"2024-03-13T19:41:54.423509864Z","caller":"jobframework/reconciler.go:202","msg":"Removing finalizer","controller":"rayjob","controllerGroup":"ray.io","controllerKind":"RayJob","RayJob":{"name":"rayjob-sample","namespace":"default"},"namespace":"default","name":"rayjob-sample","reconcileID":"c3bc5a6e-4af2-46a9-9656-939200d808b5","job":"default/rayjob-sample","gvk":"ray.io/v1, Kind=RayJob","error":"Operation cannot be fulfilled on workloads.kueue.x-k8s.io \"rayjob-rayjob-sample-8f26e\": StorageError: invalid object, Code: 4, Key: /registry/kueue.x-k8s.io/workloads/default/rayjob-rayjob-sample-8f26e, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4c144f8e-b609-4495-84c1-3a71a73dc4d6, UID in object meta: ","stacktrace":"sigs.k8s.io/kueue/pkg/controller/jobframework.(*JobReconciler).ReconcileGenericJob\n\t/workspace/pkg/controller/jobframework/reconciler.go:202\nsigs.k8s.io/kueue/pkg/controller/jobframework.(*genericReconciler).Reconcile\n\t/workspace/pkg/controller/jobframework/reconciler.go:1003\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2024-03-13T19:41:54.423578204Z","caller":"controller/controller.go:329","msg":"Reconciler error","controller":"rayjob","controllerGroup":"ray.io","controllerKind":"RayJob","RayJob":{"name":"rayjob-sample","namespace":"default"},"namespace":"default","name":"rayjob-sample","reconcileID":"c3bc5a6e-4af2-46a9-9656-939200d808b5","error":"Operation cannot be fulfilled on workloads.kueue.x-k8s.io \"rayjob-rayjob-sample-8f26e\": StorageError: invalid object, Code: 4, Key: /registry/kueue.x-k8s.io/workloads/default/rayjob-rayjob-sample-8f26e, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4c144f8e-b609-4495-84c1-3a71a73dc4d6, UID in object meta: ","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

mbzomowski avatar Mar 13 '24 19:03 mbzomowski

@mbzomowski I've been able to reproduce. That issue is actually unrelated to that PR and should be fixed with #1844.

astefanutti avatar Mar 14 '24 17:03 astefanutti

Rebased with the new changes & everything works! Thanks so much @astefanutti!

mbzomowski avatar Mar 15 '24 17:03 mbzomowski

@mbzomowski great, thanks for the feedback, and the thorough testing!

astefanutti avatar Mar 18 '24 14:03 astefanutti

LGTM label has been added.

Git tree hash: eeb318d227895819982a137aae623f27907664e9

k8s-ci-robot avatar Mar 25 '24 08:03 k8s-ci-robot

@alculquicondor @tenzen-y @andrewsykim @kevin85421 Do we want to get this in the next Kueue minor release, now that KubeRay v1.1.0 has been released?

astefanutti avatar Mar 28 '24 13:03 astefanutti

@alculquicondor @tenzen-y @andrewsykim @kevin85421 Do we want to get this in the next Kueue minor release, now that KubeRay v1.1.0 has been released?

Based on this discussion: https://github.com/kubernetes-sigs/kueue/pull/1435#issuecomment-1878911426

I agree with the migration of KubeRay v1 in Kueue v0.7.

tenzen-y avatar Mar 28 '24 14:03 tenzen-y

sgtm, please rebase

And put ACTION REQUIRED in the release notes to ask users to upgrade to a version of kuberay that supports v1 (1.0?)

alculquicondor avatar Mar 28 '24 17:03 alculquicondor

PR rebased, and release notes updated. PTAL, thanks.

astefanutti avatar Mar 29 '24 12:03 astefanutti

@astefanutti Could you address CI errors?

tenzen-y avatar Apr 02 '24 16:04 tenzen-y

/retest

astefanutti avatar Apr 02 '24 16:04 astefanutti

@tenzen-y retesting as I think these were transient failures.

astefanutti avatar Apr 02 '24 16:04 astefanutti

LGTM label has been added.

Git tree hash: 261fce03970d3454a8a9b35fb6516b8050daff94

k8s-ci-robot avatar Apr 02 '24 17:04 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Apr 02 '24 17:04 k8s-ci-robot

@astefanutti Could you update documentation and examples in a follow-up?

  • https://github.com/kubernetes-sigs/kueue/blob/main/site/content/en/docs/tasks/run/rayjobs.md
  • https://github.com/kubernetes-sigs/kueue/blob/main/site/static/examples/jobs/ray-job-sample.yaml

@tenzen-y as far as I can see, the documentation and examples have been updated in the scope of that PR already, unless you have specific changes in mind?

On the other, while giving it another look, I've found another follow-up that's covered by #1941.

astefanutti avatar Apr 03 '24 13:04 astefanutti

Sorry I'm late to review this, @astefanutti do we also want to update the go module version of kuberay? I ran into some corner cases where webhook logic drops fields because the imported type in Kueue doesn't have all the needed fields.

andrewsykim avatar Apr 03 '24 14:04 andrewsykim

We should update to v1.1, but make sure it's backwards compatible with a cluster on v1.0

andrewsykim avatar Apr 03 '24 14:04 andrewsykim

Oh ignore me, looks like it's using v1.1 already :)

andrewsykim avatar Apr 03 '24 14:04 andrewsykim

@astefanutti Could you update documentation and examples in a follow-up?

  • https://github.com/kubernetes-sigs/kueue/blob/main/site/content/en/docs/tasks/run/rayjobs.md
  • https://github.com/kubernetes-sigs/kueue/blob/main/site/static/examples/jobs/ray-job-sample.yaml

@tenzen-y as far as I can see, the documentation and examples have been updated in the scope of that PR already, unless you have specific changes in mind?

On the other, while giving it another look, I've found another follow-up that's covered by #1941.

Oops, you're right. NVM

tenzen-y avatar Apr 03 '24 14:04 tenzen-y

Oops, you're right. NVM

No pb, just wanted to make sure I was not missing anything!

astefanutti avatar Apr 03 '24 17:04 astefanutti