cluster-api Improve patchHelper to allow atomically updating a set of fields

Today patchHelper calculates patches by comparing two objects (let's call them original & desired). The patch that is send to the apiserver then only contains the delta.

This works fine as long as the original object is up-to-date. The problems start when the original object is outdated, e.g. because we read it from the controller-runtime cache and the object was stale. Please note that when patching status & spec the patch helper does not use optimistic locking.

An example:

original KCP object (stale)

status:
  replicas: 3
  readyReplicas: 3

desired KCP object

status:
  replicas: 4
  readyReplicas: 3

=> patchHelper will send a patch that sets .status.replicas = 4. readyReplicas is not included because patchHelper couldn't see a delta.

But the actual KCP object was:

status:
  replicas: 3
  readyReplicas: 2

So after patching we will end up with the following KCP object:

status:
  replicas: 4
  readyReplicas: 2

At no point did KCP calculate this combination of counter fields, but we still end up with this state in the apiserver because the patch calculation was based on a stale object. This can happen with various values for the counter fields and it gets problematic when other controllers are reading the KCP status and then making decisions based on it, e.g. they might think KCP has been already entirely upgraded if all replica fields are 3, but this might not actually be the case.

This might sound like a contrived example, but we actually had this problem and it's super hard to debug. The Cluster topology controller read KCP status, determined KCP was already entirely upgraded and then triggered worker upgrades, even though the KCP ugprade was still in progress.

One way to solve this would be to not read KCP from the cache and read it from the apiserver directly, but this is not great for scalability and puts more load on the kube-apiserver.

My proposal would be to add an option to patchHelper to ensure a certain set of fields is always part of the patch, if one of the fields changed. Based on the example above this option would be used to ensure replicas & readyReplicas are always patched together.

Somewhat related to ongoing work in https://github.com/kubernetes-sigs/cluster-api/issues/11105

Nov 11 '24 17:11 sbueringer

@vincepri @fabriziopandini @chrischdi Opinions?

Nov 11 '24 17:11 sbueringer

One way to solve this would be to not read KCP from the cache and read it from the apiserver directly

Even this change would just minimize the chances of getting an outdated object, but not eliminate the problem, right? I think we would need to go with an optimistic locking approach, or your latter proposal of identifying which sets of change must be atomic.

to add an option to patchHelper to ensure a certain set of fields is always part of the patch, if one of the fields changed

Would this be an additional argument or arguments the caller needs to supply when calling patchHelper? Sort of an opt-in approach. Or would patchHelper have knowledge of correlated field requirements for CAPI objects?

Nov 22 '24 23:11 mboersma

My proposal would be to add an option to patchHelper to ensure a certain set of fields is always part of the patch, if one of the fields changed. Based on the example above this option would be used to ensure replicas & readyReplicas are always patched together.

In some ways though it feels like we might want to enable optimistic locking? That kind of guarantees what you mention above, although has other side effects.

Nov 23 '24 23:11 vincepri

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 21 '25 23:02 k8s-triage-robot

/assign sbueringer

Mar 05 '25 13:03 chrischdi

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Jun 03 '25 14:06 k8s-triage-robot

/triage accepted

Jun 03 '25 17:06 sbueringer

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Sep 01 '25 17:09 k8s-triage-robot

/triage accepted

Sep 01 '25 18:09 sbueringer

What would be the downsides of always using optimistic locking from the patch helper? I guess there would be more requests to the API server, because requests that would otherwise succeed could now fail due to the optimistic locking. Any other downsides?

Oct 09 '25 11:10 fiunchinho

The patchHelper today does up to 3 individual write calls. If all of them are using optimistic locking which resourceVersion would we use after the first call?

The status condition patch call also has custom merge logic that can succeed and make sense despite being applied based on a stale object.

The situation is pretty complex

Oct 09 '25 11:10 sbueringer

We have deployed a fork with Optimistic Locking in a CAPA environment that manages 4-5 clusters with nodes coming and going at any given time and we don't seem to see many 409 where the CAPI operator has to retry.

Oct 09 '25 13:10 paurosello

If you use optimistic locking on all 3 writes, which resourceVersion are you using for the 2nd and 3rd patch?

How often are you patching more than just one of: status / status.conditions / spec?

And how many controllers do you have reconciling the same object?

Oct 09 '25 13:10 sbueringer

We are not changing the rescourceVersion between the 3 patches, which means whenever 1 of them is actually patching something the ones afterward would fail with a 409.

On top of the CAPI controllers we also have cilium, ebs-csi-drivers and karpenter. Karpenter is the reason we found this issue as the karpenter taint is deleted first but then CAPI controller adds it again every once in a while because it's using a stale version of the object.

My guess is that it's not really common to update status / status.conditions / spec in the same reconciliation loop. And if it happens it's simply fixed on the retry.

Oct 09 '25 14:10 paurosello

One very common use case to update status and status.conditions is when metadata.generation changes. Usually .status.generation and the generation in metav1.Conditions is then bumped

Oct 09 '25 16:10 sbueringer

cluster-api cluster-api copied to clipboard

Improve patchHelper to allow atomically updating a set of fields

cluster-api
cluster-api copied to clipboard