cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

Can patchHelper provide an option to verify resourceVersion?

Open huaqing1994 opened this issue 2 years ago • 10 comments

As a developer, I use the patchHelper in our controllers to patch our CR at the end of the reconcile(), like CAPI does.

We found a small problem that patchHelper uses client.MergeFrom() in Helper.patch() and Helper.patchStatus(), causing apiserver not to verify resourceVersion when patching. Sometimes, the controller may reconcile old data (because the controller's cache is not updated in time), and the fields(like some timestamp or random ID) written by the last reconcile() will be overwritten by this reconcile().

We want to avoid overwriting fields due to reconciling old data, and we don't need different controllers to act on conditions of the same object. Can patchHelper provide an option to verify resourceVersion? Prevent patching when resourceVersion validation fails.

huaqing1994 avatar Mar 29 '23 06:03 huaqing1994

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 29 '23 06:03 k8s-ci-robot

/triage needs-information

I'm a little bit concerned about adding options that we are not using in CAPI for a couple of practical reasons, but I would like also to better understand the use cases because at first sight, this seems a symptom of the controller relying on something set in previous reconcile, which is generally a bad practice

fabriziopandini avatar Apr 14 '23 16:04 fabriziopandini

@fabriziopandini I understand your concerns.

Another specific case we have is: CR waits for the execution of an asynchronous task. The task ID is recorded in status.taskID. Every reconcile() will check whether the task is completed, and delete the status.taskID if it is completed. Sometimes, after task A is executed, task B will be executed immediately. After the task ID of task B is recorded in status.taskID, the next reconcile() may also get the task ID of task A in status.taskID. Since task A has completed, the status.taskID will be deleted. But in fact, the deleted task ID belongs to task B, causing task B to be lost.

So it would be nice if CAPI patcher could verify resourceVersion.

huaqing1994 avatar Apr 19 '23 04:04 huaqing1994

This is already possible in controller runtime by using https://github.com/kubernetes-sigs/controller-runtime/blob/3d915df2799fcacd28e679bf1df4a59a957644c9/pkg/client/patch.go#L57-L71 (which the condition patches already force)

vincepri avatar Apr 19 '23 14:04 vincepri

hi @fabriziopandini could we add the option MergeFromWithOptimisticLock to CAPI patch helper?

jessehu avatar May 21 '23 03:05 jessehu

Does the solution suggest by @vincepri works? this will allow you to address the use case using CR Patch without CAPI implementing an option we don't use - we will struggle to prevent regression on -.

But I'm also open to other options if there is consensus on how to address ^^ (or this if is not perceived as a concern).

fabriziopandini avatar May 22 '23 17:05 fabriziopandini

Just saw https://github.com/kubernetes-sigs/cluster-api/issues/8706, which is somehow related

fabriziopandini avatar May 22 '23 17:05 fabriziopandini

https://github.com/kubernetes-sigs/cluster-api/issues/8706 is exactly we meant. We hit bugs in the described scenario when using CAPI patchHelper.

jessehu avatar May 26 '23 14:05 jessehu

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 21 '24 09:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 20 '24 09:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 21 '24 10:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 21 '24 10:03 k8s-ci-robot