refactor: improve error handling with client.IgnoreNotFound in controllers
What type of PR is this?
/kind bug
What this PR does / why we need it:
follow-up: https://github.com/kubernetes-sigs/kueue/pull/5786#issuecomment-3012681011
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
NONE
Hi @xigang. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Deploy Preview for kubernetes-sigs-kueue canceled.
| Name | Link |
|---|---|
| Latest commit | 51f2814c041eca29348945573cbff9db7b93a0f6 |
| Latest deploy log | https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6862aa6fc8b9f200083d4d9b |
/cc @mbobrovskyi
PTAL.
/ok-to-test
Also maybe here:
https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/admissioncheck.go#L128
https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/provisioning/admissioncheck_reconciler.go#L61
https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/admissioncheck.go#L124
https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/multikueuecluster.go#L425
Also maybe here:
https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/admissioncheck.go#L128
https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/provisioning/admissioncheck_reconciler.go#L61
https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/admissioncheck.go#L124
https://github.com/kubernetes-sigs/kueue/blob/89f1c037aab465d5d8faf3b0668e5c7cf10c8b46/pkg/controller/admissionchecks/multikueue/multikueuecluster.go#L425
Done.
What if the cache still has non-existent objects? Did you check if there is no any path to have outdated cache information?
Here, we check whether the object exists when sending an Update request to the APIServer using client.IgnoreNotFound, and we do not use data from the informer cache.
@mbobrovskyi PTAL. Thanks!
Here, we check whether the object exists when sending an Update request to the APIServer using client.IgnoreNotFound, and we do not use data from the informer cache.
I meant Kueue cache mechanism instead of informer cache.
@tenzen-y The Kueue cache can indeed run into stale data issues. What I’m thinking is to ensure eventual consistency by handling Delete events in the Informer’s EventHandler. Alternatively, we could proactively remove the object from the Kueue cache when a NotFound error occurs.
Any thoughts on this?
@mbobrovskyi Fixed, please take another look.😄
Wow. Thank you for the quick fix!
Overall LGTM. Just one point.
I think we should add an IgnoreNotFound for this:
https://github.com/kubernetes-sigs/kueue/blob/0cd304dfaf87dc672b7aef79f72816de07281c65/pkg/controller/core/workload_controller.go#L339-L358
But it should be placed outside the function, not inside. WDYT?
@tenzen-y The Kueue cache can indeed run into stale data issues. What I’m thinking is to ensure eventual consistency by handling Delete events in the Informer’s EventHandler. Alternatively, we could proactively remove the object from the Kueue cache when a NotFound error occurs.
Any thoughts on this?
As I checked the controller-runtime implementations, the reconcile requests are directly pushed to the workqueue w/o predicate and event handlers when reconciliation failed: https://github.com/kubernetes-sigs/controller-runtime/blob/cacd627d0b9bc46b592cf5f96f57f1308313618e/pkg/internal/controller/controller.go#L345-L347
So, NVM my above comment.
But it should be placed outside the function, not inside. WDYT?
@mbobrovskyi I personally feel we might not want to expose IgnoreNotFound externally. It might be cleaner and more appropriate to keep it together with ApplyAdmissionStatus. What do you think?
@tenzen-y That's right — if the reconcile fails, the key will be requeued into the workqueue.
@mbobrovskyi I personally feel we might not want to expose IgnoreNotFound externally. It might be cleaner and more appropriate to keep it together with ApplyAdmissionStatus. What do you think?
In that case, we would need to duplicate the check multiple times within the function. The tricky part is that we can't easily distinguish whether it's a genuine "not found" error or just a case where the object isn't returned but no error is raised. As a result, in Reconcile(), we might proceed to the next step even if it's actually a NotFound error.
Could you please also close resolved comments? It is very hard to understand which one is Done.
Could you please also close resolved comments? It is very hard to understand which one is Done.
All resolved comments have been closed.
I think we should add an IgnoreNotFound for this:
https://github.com/kubernetes-sigs/kueue/blob/0cd304dfaf87dc672b7aef79f72816de07281c65/pkg/controller/core/workload_controller.go#L339-L358
But it should be placed outside the function, not inside. WDYT?
@mbobrovskyi Fixed. Please take another look.
LGTM label has been added.
@mimowo @tenzen-y For approve. Thanks!
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: mbobrovskyi, mimowo, xigang
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [mimowo]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@tenzen-y I'll check it now.
@tenzen-y Thank you for the thorough review! @xigang Could you please create a follow-up PR to address this?