kueue icon indicating copy to clipboard operation
kueue copied to clipboard

refactor: improve error handling with client.IgnoreNotFound in controllers

Open xigang opened this issue 10 months ago • 7 comments

What type of PR is this?

/kind bug

What this PR does / why we need it:

follow-up: https://github.com/kubernetes-sigs/kueue/pull/5786#issuecomment-3012681011

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

xigang avatar Jun 28 '25 03:06 xigang

Hi @xigang. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 28 '25 03:06 k8s-ci-robot

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
Latest commit 51f2814c041eca29348945573cbff9db7b93a0f6
Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6862aa6fc8b9f200083d4d9b

netlify[bot] avatar Jun 28 '25 03:06 netlify[bot]

/cc @mbobrovskyi

PTAL.

xigang avatar Jun 28 '25 03:06 xigang

/ok-to-test

mbobrovskyi avatar Jun 28 '25 04:06 mbobrovskyi

Also maybe here:

https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/admissioncheck.go#L128

https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/provisioning/admissioncheck_reconciler.go#L61

https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/admissioncheck.go#L124

https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/multikueuecluster.go#L425

mbobrovskyi avatar Jun 28 '25 04:06 mbobrovskyi

Also maybe here:

https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/admissioncheck.go#L128

https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/provisioning/admissioncheck_reconciler.go#L61

https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/admissioncheck.go#L124

https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/multikueuecluster.go#L425

Done.

xigang avatar Jun 28 '25 09:06 xigang

What if the cache still has non-existent objects? Did you check if there is no any path to have outdated cache information?

Here, we check whether the object exists when sending an Update request to the APIServer using client.IgnoreNotFound, and we do not use data from the informer cache.

xigang avatar Jun 30 '25 06:06 xigang

@mbobrovskyi PTAL. Thanks!

xigang avatar Jun 30 '25 06:06 xigang

Here, we check whether the object exists when sending an Update request to the APIServer using client.IgnoreNotFound, and we do not use data from the informer cache.

I meant Kueue cache mechanism instead of informer cache.

tenzen-y avatar Jun 30 '25 07:06 tenzen-y

@tenzen-y The Kueue cache can indeed run into stale data issues. What I’m thinking is to ensure eventual consistency by handling Delete events in the Informer’s EventHandler. Alternatively, we could proactively remove the object from the Kueue cache when a NotFound error occurs.

Any thoughts on this?

xigang avatar Jun 30 '25 07:06 xigang

@mbobrovskyi Fixed, please take another look.😄

xigang avatar Jun 30 '25 10:06 xigang

Wow. Thank you for the quick fix!

Overall LGTM. Just one point.

I think we should add an IgnoreNotFound for this:

https://github.com/kubernetes-sigs/kueue/blob/0cd304dfaf87dc672b7aef79f72816de07281c65/pkg/controller/core/workload_controller.go#L339-L358

But it should be placed outside the function, not inside. WDYT?

mbobrovskyi avatar Jun 30 '25 11:06 mbobrovskyi

@tenzen-y The Kueue cache can indeed run into stale data issues. What I’m thinking is to ensure eventual consistency by handling Delete events in the Informer’s EventHandler. Alternatively, we could proactively remove the object from the Kueue cache when a NotFound error occurs.

Any thoughts on this?

As I checked the controller-runtime implementations, the reconcile requests are directly pushed to the workqueue w/o predicate and event handlers when reconciliation failed: https://github.com/kubernetes-sigs/controller-runtime/blob/cacd627d0b9bc46b592cf5f96f57f1308313618e/pkg/internal/controller/controller.go#L345-L347

So, NVM my above comment.

tenzen-y avatar Jun 30 '25 11:06 tenzen-y

But it should be placed outside the function, not inside. WDYT?

@mbobrovskyi I personally feel we might not want to expose IgnoreNotFound externally. It might be cleaner and more appropriate to keep it together with ApplyAdmissionStatus. What do you think?

xigang avatar Jun 30 '25 11:06 xigang

@tenzen-y That's right — if the reconcile fails, the key will be requeued into the workqueue.

xigang avatar Jun 30 '25 11:06 xigang

@mbobrovskyi I personally feel we might not want to expose IgnoreNotFound externally. It might be cleaner and more appropriate to keep it together with ApplyAdmissionStatus. What do you think?

In that case, we would need to duplicate the check multiple times within the function. The tricky part is that we can't easily distinguish whether it's a genuine "not found" error or just a case where the object isn't returned but no error is raised. As a result, in Reconcile(), we might proceed to the next step even if it's actually a NotFound error.

mbobrovskyi avatar Jun 30 '25 11:06 mbobrovskyi

Could you please also close resolved comments? It is very hard to understand which one is Done.

mbobrovskyi avatar Jun 30 '25 11:06 mbobrovskyi

Could you please also close resolved comments? It is very hard to understand which one is Done.

All resolved comments have been closed.

xigang avatar Jun 30 '25 11:06 xigang

I think we should add an IgnoreNotFound for this:

https://github.com/kubernetes-sigs/kueue/blob/0cd304dfaf87dc672b7aef79f72816de07281c65/pkg/controller/core/workload_controller.go#L339-L358

But it should be placed outside the function, not inside. WDYT?

@mbobrovskyi Fixed. Please take another look.

xigang avatar Jun 30 '25 15:06 xigang

LGTM label has been added.

Git tree hash: a04c20c1203eabafc961ff482948ee4ccaadb99b

k8s-ci-robot avatar Jun 30 '25 17:06 k8s-ci-robot

@mimowo @tenzen-y For approve. Thanks!

xigang avatar Jul 01 '25 00:07 xigang

/approve

mimowo avatar Jul 01 '25 05:07 mimowo

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mbobrovskyi, mimowo, xigang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 01 '25 05:07 k8s-ci-robot

@tenzen-y I'll check it now.

xigang avatar Jul 01 '25 14:07 xigang

@tenzen-y Thank you for the thorough review! @xigang Could you please create a follow-up PR to address this?

mbobrovskyi avatar Jul 01 '25 15:07 mbobrovskyi