kapp icon indicating copy to clipboard operation
kapp copied to clipboard

Resource conflict regression from v0.60.0

Open universam1 opened this issue 11 months ago • 9 comments

What steps did you take:

We are unable to use any version newer than v0.59.4 with app-deploy, failing with resource conflict (approved diff no longer matches).

By method of elimination we have tested following versions:

0.63.3: FAIL 0.62.1: FAIL 0.61.0: FAIL 0.60.2: FAIL 0.60.0: FAIL 0.59.4: SUCCESS

What happened:

We are deploying full cluster config from scratch via kapp app-deploy, in sum ~800 resources, within a single app. This works amazingly well with Kapp, way better than Helm!

However, since v0.60.0 on the first apply we encounter this error:

  - update daemonset/aws-node (apps/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource daemonset/aws-node (apps/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on daemonsets.apps \"aws-node\": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations:
  4,  3 -     deprecated.daemonset.template.generation: \"1\"
  9,  7 -     app.kubernetes.io/managed-by: Helm
 11,  8 -     app.kubernetes.io/version: v1.19.0
 12,  8 -     helm.sh/chart: aws-vpc-cni-1.19.0
 14,  9 +     kapp.k14s.io/app: \"1733129085919676830\"
 14, 10 +     kapp.k14s.io/association: v1.ca251169611f162ef5186bbf4f512ca0
326,323 -   revisionHistoryLimit: 10
332,328 -       creationTimestamp: null
337,332 +         kapp.k14s.io/app: \"1733129085919676830\"
337,333 +         kapp.k14s.io/association: v1.ca251169611f162ef5186bbf4f512ca0
356,353 -                 - hybrid
357,353 -                 - auto
362,357 -         - name: ANNOTATE_POD_IP
363,357 -           value: \"false\"
384,377 -         - name: CLUSTER_NAME
385,377 -           value: o11n-eks-int-4151
399,390 -           value: \"false\"
400,390 +           value: \"true\"
402,393 +         - name: MINIMUM_IP_TARGET
402,394 +           value: \"25\"
405,398 -           value: v1.19.0
406,398 -         - name: VPC_ID
407,398 -           value: vpc-23837a4a
408,398 +           value: v1.18.2
409,400 -           value: \"1\"
410,400 +           value: \"0\"
410,401 +         - name: WARM_IP_TARGET
410,402 +           value: \"5\"
411,404 -           value: \"1\"
412,404 +           value: \"0\"
422,415 -         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon-k8s-cni:v1.19.0-eksbuild.1
423,415 -         imagePullPolicy: IfNotPresent
424,415 +         image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.18.2
431,423 -           failureThreshold: 3
433,424 -           periodSeconds: 10
434,424 -           successThreshold: 1
440,429 -           protocol: TCP
448,436 -           failureThreshold: 3
450,437 -           periodSeconds: 10
451,437 -           successThreshold: 1
455,440 -             cpu: 25m
456,440 +             cpu: 50m
456,441 +             memory: 80Mi
461,447 -         terminationMessagePath: /dev/termination-log
462,447 -         terminationMessagePolicy: File
489,473 -         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon/aws-network-policy-agent:v1.1.5-eksbuild.1
490,473 -         imagePullPolicy: Always
491,473 +         image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.1.2
494,477 -             cpu: 25m
495,477 +             cpu: 50m
495,478 +             memory: 80Mi
500,484 -         terminationMessagePath: /dev/termination-log
501,484 -         terminationMessagePolicy: File
511,493 -       dnsPolicy: ClusterFirst
519,500 -         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon-k8s-cni-init:v1.19.0-eksbuild.1
520,500 -         imagePullPolicy: Always
521,500 +         image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.18.2
524,504 -             cpu: 25m
525,504 +             cpu: 50m
525,505 +             memory: 80Mi
527,508 -         terminationMessagePath: /dev/termination-log
528,508 -         terminationMessagePolicy: File
533,512 -       restartPolicy: Always
534,512 -       schedulerName: default-scheduler
536,513 -       serviceAccount: aws-node
544,520 -           type: \"\"
548,523 -           type: \"\"
552,526 -           type: \"\"
568,541 -       maxSurge: 0


  - update daemonset/kube-proxy (apps/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource daemonset/kube-proxy (apps/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on daemonsets.apps \"kube-proxy\": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations:
  4,  3 -     deprecated.daemonset.template.generation: \"1\"
 10,  8 +     kapp.k14s.io/app: \"1733129085919676830\"
 10,  9 +     kapp.k14s.io/association: v1.5c5a114581f350e2b57df0ed7799471d
134,134 +         kapp.k14s.io/app: \"1733129085919676830\"
134,135 +         kapp.k14s.io/association: v1.5c5a114581f350e2b57df0ed7799471d
153,155 -                 - auto
159,160 -         - --hostname-override=$(NODE_NAME)
160,160 -         env:
161,160 -         - name: NODE_NAME
162,160 -           valueFrom:
163,160 -             fieldRef:
164,160 -               apiVersion: v1
165,160 -               fieldPath: spec.nodeName
166,160 -         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/kube-proxy:v1.29.10-minimal-eksbuild.3
167,160 +         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/kube-proxy:v1.29.7-eksbuild.2
171,165 -             cpu: 100m
172,165 +             cpu: 50m
172,166 +             memory: 45Mi

My assumption is that a webhook or a controller might interfere here with Kapp on certain fields. However, we need to be able to configure the EKS cluster via Kapp even under a temporary clash.

What did you expect: Kapp to retry

@praveenrewar

Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible" 👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

universam1 avatar Dec 02 '24 09:12 universam1

Thank you for creating the issue @universam1!

We are deploying full cluster config from scratch via kapp app-deploy, in sum ~800 resources, within a single app. This works amazingly well with Kapp

🙏🏻

However, since v0.60.0 on the first apply we encounter this error:

  • Do these resources already exist on the cluster? (and hence an update)
  • Would you be able to share the complete output for these 2 resources with the --diff-changes flag, then we can see what the original diff was and and compare it with recalculated diff.
  • I will try to figure out what could have caused this regression in v0.60.0, I just took a quick look at the release notes for v0.60.0 but couldn't make out what could have caused it, I will take a closer look in some time.
  • Did the same issue happen even after retrying?

Kapp to retry

One of the principles for kapp is that it guarantees that it will only apply the changes that have been approved by the user. If we want to retry on this particular error, it would mean getting a confirmation from the user again, which might not be a great user experience. It would be ideal to retry the kapp deploy from outside, i.e via some pipeline or some controller like kapp-controller.

praveenrewar avatar Dec 02 '24 14:12 praveenrewar

Thank you for creating the issue @universam1! Likewise for the quick response!

  • Do these resources already exist on the cluster? (and hence an update)

Maybe! The scenario is a brand new, vanilla EKS cluster, just right after the Cloudformation returned the success request, we call Kapp to deploy the core services. Those core services include updates to existing daemonsets. Apparently, but this is not clear, EKS might have delayed deployments that might happen during Kapp runtime.

  • Would you be able to share the complete output for these 2 resources with the --diff-changes flag, then we can see what the original diff was and and compare it with recalculated diff.

Since this is transient error, quite hard to generate, but I'll try.

  • Did the same issue happen even after retrying?

Probably not. Hard to test since we are in CI pipeline here. But it seems like it does suceed after retrying. However, we cannot do that in CI due to one-time session zero credentials to EKS which is "it either succeeded or not" problem.

One of the principles for kapp is that it guarantees that it will only apply the changes that have been approved by the user. If we want to retry on this particular error, it would mean getting a confirmation from the user again, which might not be a great user experience. It would be ideal to retry the kapp deploy from outside, i.e via some pipeline or some controller like kapp-controller.

Please consider that we are not in an interactive session here but in CI pipeline, running app-deploy. The gitOps setup is mandatory by all means! And we cannot restart the pipeline at this point, we have to succeed or the cluster is broken forever.

I agree, in an interactive session it makes sense to require another user interaction, but here in a headless mode in CI, Kapp should have an option to enforce a desired state!

universam1 avatar Dec 02 '24 15:12 universam1

Those core services include updates to existing daemonsets. Apparently, but this is not clear, EKS might have delayed deployments that might happen during Kapp runtime.

Yeah, that could be the reason.

Since this is transient error, quite hard to generate, but I'll try.

I see, thanks, if we can check both the original diff and the recalculated diff, it would help us in determining the exact fields due to which the diff is changing and we can probably add rebase rules to ignore those fields.

Probably not. Hard to test since we are in CI pipeline here. But it seems like it does suceed after retrying.

Curious to know how you were able to pinpoint the exact version of kapp with the issue.

I agree, in an interactive session it makes sense to require another user interaction, but here in a headless mode in CI, Kapp should have an option to enforce a desired state!

I agree that such an option would be useful and I have seen a few similar requests in the past. I think it would be good to first determine the root cause and see if a rebase rule would help else we can think of the best way retry in such cases.

praveenrewar avatar Dec 02 '24 21:12 praveenrewar

Those core services include updates to existing daemonsets. Apparently, but this is not clear, EKS might have delayed deployments that might happen during Kapp runtime.

Yeah, that could be the reason.

I have more results from testing, and the problem is not scoped to managed resources. It happens also for resources that are solely owned by Kapp! And it is reproducable. Let me attach examples below.

I see, thanks, if we can check both the original diff and the recalculated diff, it would help us in determining the exact fields due to which the diff is changing and we can probably add rebase rules to ignore those fields.

See following examples, those resources are Kapp owned and not touched by any other operator. This is the output of a 3rd retry (I was able to implement CI job retries)!

original diff
@@ update deployment/skipper-ingress (apps/v1) namespace: kube-system @@
  ...
205,205   spec:
206     -   progressDeadlineSeconds: 600
207,206     replicas: 2
208     -   revisionHistoryLimit: 10
209,207     selector:
210,208       matchLabels:
  ...
215,213         maxUnavailable: 0
216     -     type: RollingUpdate
217,214     template:
218,215       metadata:
219     -       creationTimestamp: null
220,216         labels:
221,217           application: skipper-ingress
  ...
289,285           image: registry.opensource.zalan.do/teapot/skipper:v0.21.223
290     -         imagePullPolicy: IfNotPresent
291,286           name: skipper
292,287           ports:
  ...
294,289             name: ingress-port
295     -           protocol: TCP
296,290           - containerPort: 9998
297,291             name: redirect-port
298     -           protocol: TCP
299,292           - containerPort: 9911
300,293             name: metrics-port
301     -           protocol: TCP
302,294           readinessProbe:
303     -           failureThreshold: 3
304,295             httpGet:
305,296               path: /kube-system/healthz
  ...
308,299             initialDelaySeconds: 5
309     -           periodSeconds: 10
310     -           successThreshold: 1
311,300             timeoutSeconds: 1
312,301           resources:
  ...
315,304               memory: 200Mi
316     -         terminationMessagePath: /dev/termination-log
317     -         terminationMessagePolicy: File
318,305           volumeMounts:
319,306           - mountPath: /etc/skipper-cert
  ...
329,316             name: skipper-init
330     -       dnsPolicy: ClusterFirst
331,317         priorityClassName: system-cluster-critical
332     -       restartPolicy: Always
333     -       schedulerName: default-scheduler
334     -       securityContext: {}
335     -       serviceAccount: skipper-ingress
336,318         serviceAccountName: skipper-ingress
337     -       terminationGracePeriodSeconds: 30
338,319         tolerations:
339,320         - effect: NoExecute
  ...
346,327           secret:
347     -           defaultMode: 420
348,328             secretName: skipper-cert
349,329         - name: vault-tls
350,330           secret:
351     -           defaultMode: 420
352,331             secretName: vault-tls
353,332         - name: oidc-secret-file
354,333           secret:
355     -           defaultMode: 420
356,334             secretName: skipper-oidc-secret
357,335         - configMap:

@@ update poddisruptionbudget/skipper-ingress (policy/v1) namespace: kube-system @@
  ...
  2,  2   metadata:
  3     -   annotations: {}
  4,  3     creationTimestamp: "2024-12-02T08:22:15Z"
  5,  4     generation: 1

@@ update prometheus/k8s (monitoring.coreos.com/v1) namespace: monitoring @@
...
2,  2   metadata:
3     -   annotations: {}
4,  3     creationTimestamp: "2024-12-02T08:22:20Z"
5,  4     generation: 1
recalculated diff
Error: 
  - update deployment/skipper-ingress (apps/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource deployment/skipper-ingress (apps/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on deployments.apps "skipper-ingress": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
207,207 -   progressDeadlineSeconds: 600
209,208 -   revisionHistoryLimit: 10
217,215 -     type: RollingUpdate
220,217 -       creationTimestamp: null
291,287 -         imagePullPolicy: IfNotPresent
296,291 -           protocol: TCP
299,293 -           protocol: TCP
302,295 -           protocol: TCP
304,296 -           failureThreshold: 3
310,301 -           periodSeconds: 10
311,301 -           successThreshold: 1
317,306 -         terminationMessagePath: /dev/termination-log
318,306 -         terminationMessagePolicy: File
331,318 -       dnsPolicy: ClusterFirst
333,319 -       restartPolicy: Always
334,319 -       schedulerName: default-scheduler
335,319 -       securityContext: {}
336,319 -       serviceAccount: skipper-ingress
338,320 -       terminationGracePeriodSeconds: 30
348,329 -           defaultMode: 420
352,332 -           defaultMode: 420
356,335 -           defaultMode: 420
  - update poddisruptionbudget/skipper-ingress (policy/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource poddisruptionbudget/skipper-ingress (policy/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on poddisruptionbudgets.policy "skipper-ingress": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations: {}
  - update horizontalpodautoscaler/skipper-ingress (autoscaling/v2) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource horizontalpodautoscaler/skipper-ingress (autoscaling/v2) namespace: kube-system: API server says: Operation cannot be fulfilled on horizontalpodautoscalers.autoscaling "skipper-ingress": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
 95, 95 -       selectPolicy: Max
102,101 -       selectPolicy: Max
  - update prometheus/k8s (monitoring.coreos.com/v1) namespace: monitoring: Failed to update due to resource conflict (approved diff no longer matches): Updating resource prometheus/k8s (monitoring.coreos.com/v1) namespace: monitoring: API server says: Operation cannot be fulfilled on prometheuses.monitoring.coreos.com "k8s": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations: {}
189,188 -   evaluationInterval: 30s
205,203 -   portName: web
233,230 -   scrapeInterval: 30s

Curious to know how you were able to pinpoint the exact version of kapp with the issue.

We are running v0.58.0 in production. Once upgrading to 0.63.3 we faced all integration pipelines failing, consistently. In order to determine the problematic release, I created versions of our CI tooling with all minor versions of Kapp btw. those two versions and discovered that the latest working version is v0.59.4. Now we are running this version in production.

I agree that such an option would be useful and I have seen a few similar requests in the past. I think it would be good to first determine the root cause and see if a rebase rule would help else we can think of the best way retry in such cases.

BTW. I was able to implement a Kapp - retry in our CI tool nevertheless. However, even that fails consistently and we are even with 3 retries unable to converge successfully! It just fails on other resources. So there is a principle regression.

universam1 avatar Dec 03 '24 08:12 universam1

Thanks a lot for the details @universam1! Out of the 3 resources that you have shared, 2 of them have the same original diff and the recalculated diff, which is definitely weird and probably an issue. I have a hunch about a few changes that could have caused this in v0.60.0. I will try taking a closer look at those changes to see which one could be the root cause. Since, I am not able to reproduce the issue on my end I might need your help in validating the fix.

praveenrewar avatar Dec 03 '24 08:12 praveenrewar

Thank you @praveenrewar for you help! Happy to assist, let me know where I can help! BTW. we are using Kapp as Go pkg in our CI tool, in case that matters.

universam1 avatar Dec 03 '24 09:12 universam1

@praveenrewar One interesting detail comparing the logs is that the working versions of Kapp output a lot of Retryable error: and eventually succeed, while from v0.60 on not a single retryable log is omitted. Could it be that this internal retry logic is not catching any more?

example retryable logs
8:09:40AM: create issuer/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault
8:09:40AM:  ^ Retryable error: Creating resource issuer/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:40AM: create certificate/vault-secrets-webhook-webhook-tls (cert-manager.io/v1) namespace: vault
8:09:40AM:  ^ Retryable error: Creating resource certificate/vault-secrets-webhook-webhook-tls (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:40AM: create issuer/vault-secrets-webhook-selfsign (cert-manager.io/v1) namespace: vault
8:09:40AM:  ^ Retryable error: Creating resource issuer/vault-secrets-webhook-selfsign (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:40AM: create certificate/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault
8:09:40AM:  ^ Retryable error: Creating resource certificate/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create clusterissuer/selfsigned-issuer (cert-manager.io/v1) cluster
8:09:45AM:  ^ Retryable error: Creating resource clusterissuer/selfsigned-issuer (cert-manager.io/v1) cluster: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create certificate/hubble-server-certs (cert-manager.io/v1) namespace: kube-system
8:09:45AM:  ^ Retryable error: Creating resource certificate/hubble-server-certs (cert-manager.io/v1) namespace: kube-system: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create certificate/hubble-relay-client-certs (cert-manager.io/v1) namespace: kube-system
8:09:45AM:  ^ Retryable error: Creating resource certificate/hubble-relay-client-certs (cert-manager.io/v1) namespace: kube-system: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create certificate/cilium-selfsigned-ca (cert-manager.io/v1) namespace: cert-manager
8:09:45AM:  ^ Retryable error: Creating resource certificate/cilium-selfsigned-ca (cert-manager.io/v1) namespace: cert-manager: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:10:29AM: create certificate/aws-load-balancer-serving-cert (cert-manager.io/v1) namespace: kube-system
8:10:29AM:  ^ Retryable error: Creating resource certificate/aws-load-balancer-serving-cert (cert-manager.io/v1) namespace: kube-system: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:10:34AM: create issuer/self-signer (cert-manager.io/v1) namespace: kube-system

universam1 avatar Dec 04 '24 08:12 universam1

It might be that the conflict is happening before these retryable errors.

praveenrewar avatar Dec 04 '24 09:12 praveenrewar

Hard to reproduce. (haven't been able to reproduce this)

renuy avatar Jan 10 '25 06:01 renuy

We have now a critical issue with this regression, since we cannot upgrade our clusters to EKS 1.32 as we are stuck on Kapp v0.59.4. We have done a lot of testings and can provide a lot logs, Kapp now consistenly fails with Failed to update due to resource conflict errors. Please asking for some help here @praveenrewar @renuy

Example:

[debug] OpsDiff existing compact md5=dda6216a75e1b9ed999afb863c6bb073 len=1244[debug] OpsDiff new compact md5=618d4285161df9cf1e5289828af67e18 len=1227[debug] OpsDiff using Unstructured types left=map[string]interface {} right=map[string]interface {}[debug] OpsDiff existing compact md5=031ac6c986bee4a0b3695be578ea89c3 len=1244[debug] OpsDiff new compact md5=19adc0ef2a8cc6a0612680f41af8cfe9 len=1227[debug] OpsDiff using Unstructured types left=map[string]interface {} right=map[string]interface {}8:19:40PM: update verticalpodautoscaler/iam-chart (autoscaling.k8s.io/v1) namespace: ack-system

Error: 
  - update verticalpodautoscaler/iam-chart (autoscaling.k8s.io/v1) namespace: ack-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource verticalpodautoscaler/iam-chart (autoscaling.k8s.io/v1) namespace: ack-system: API server says: Operation cannot be fulfilled on verticalpodautoscalers.autoscaling.k8s.io "iam-chart": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations: {}

universam1 avatar Sep 30 '25 07:09 universam1