cluster-api
cluster-api copied to clipboard
Improve patchHelper to allow atomically updating a set of fields
Today patchHelper calculates patches by comparing two objects (let's call them original & desired). The patch that is send to the apiserver then only contains the delta.
This works fine as long as the original object is up-to-date. The problems start when the original object is outdated, e.g. because we read it from the controller-runtime cache and the object was stale. Please note that when patching status & spec the patch helper does not use optimistic locking.
An example:
original KCP object (stale)
status:
replicas: 3
readyReplicas: 3
desired KCP object
status:
replicas: 4
readyReplicas: 3
=> patchHelper will send a patch that sets .status.replicas = 4. readyReplicas is not included because patchHelper couldn't see a delta.
But the actual KCP object was:
status:
replicas: 3
readyReplicas: 2
So after patching we will end up with the following KCP object:
status:
replicas: 4
readyReplicas: 2
At no point did KCP calculate this combination of counter fields, but we still end up with this state in the apiserver because the patch calculation was based on a stale object. This can happen with various values for the counter fields and it gets problematic when other controllers are reading the KCP status and then making decisions based on it, e.g. they might think KCP has been already entirely upgraded if all replica fields are 3, but this might not actually be the case.
This might sound like a contrived example, but we actually had this problem and it's super hard to debug. The Cluster topology controller read KCP status, determined KCP was already entirely upgraded and then triggered worker upgrades, even though the KCP ugprade was still in progress.
One way to solve this would be to not read KCP from the cache and read it from the apiserver directly, but this is not great for scalability and puts more load on the kube-apiserver.
My proposal would be to add an option to patchHelper to ensure a certain set of fields is always part of the patch, if one of the fields changed. Based on the example above this option would be used to ensure replicas & readyReplicas are always patched together.
Somewhat related to ongoing work in https://github.com/kubernetes-sigs/cluster-api/issues/11105
@vincepri @fabriziopandini @chrischdi Opinions?
One way to solve this would be to not read KCP from the cache and read it from the apiserver directly
Even this change would just minimize the chances of getting an outdated object, but not eliminate the problem, right? I think we would need to go with an optimistic locking approach, or your latter proposal of identifying which sets of change must be atomic.
to add an option to patchHelper to ensure a certain set of fields is always part of the patch, if one of the fields changed
Would this be an additional argument or arguments the caller needs to supply when calling patchHelper? Sort of an opt-in approach. Or would patchHelper have knowledge of correlated field requirements for CAPI objects?
My proposal would be to add an option to patchHelper to ensure a certain set of fields is always part of the patch, if one of the fields changed. Based on the example above this option would be used to ensure replicas & readyReplicas are always patched together.
In some ways though it feels like we might want to enable optimistic locking? That kind of guarantees what you mention above, although has other side effects.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/assign sbueringer
This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Deprioritize it with
/priority important-longtermor/priority backlog - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted
This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Deprioritize it with
/priority important-longtermor/priority backlog - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted
What would be the downsides of always using optimistic locking from the patch helper? I guess there would be more requests to the API server, because requests that would otherwise succeed could now fail due to the optimistic locking. Any other downsides?
The patchHelper today does up to 3 individual write calls. If all of them are using optimistic locking which resourceVersion would we use after the first call?
The status condition patch call also has custom merge logic that can succeed and make sense despite being applied based on a stale object.
The situation is pretty complex
We have deployed a fork with Optimistic Locking in a CAPA environment that manages 4-5 clusters with nodes coming and going at any given time and we don't seem to see many 409 where the CAPI operator has to retry.
If you use optimistic locking on all 3 writes, which resourceVersion are you using for the 2nd and 3rd patch?
How often are you patching more than just one of: status / status.conditions / spec?
And how many controllers do you have reconciling the same object?
We are not changing the rescourceVersion between the 3 patches, which means whenever 1 of them is actually patching something the ones afterward would fail with a 409.
On top of the CAPI controllers we also have cilium, ebs-csi-drivers and karpenter. Karpenter is the reason we found this issue as the karpenter taint is deleted first but then CAPI controller adds it again every once in a while because it's using a stale version of the object.
My guess is that it's not really common to update status / status.conditions / spec in the same reconciliation loop. And if it happens it's simply fixed on the retry.
One very common use case to update status and status.conditions is when metadata.generation changes. Usually .status.generation and the generation in metav1.Conditions is then bumped