kapp icon indicating copy to clipboard operation
kapp copied to clipboard

Conflict on weird fields

Open revolunet opened this issue 3 years ago • 58 comments

Hi, i dont really understand some conflict errors, maybe someone can help

Heres an example; these fields appear in the diff :

  • kapp.k14s.io/nonce : sounds legit
  • image : legit as its a new version
  • initialDelaySeconds and cpu : i guess its been "rewritten" by kube API

These changes looks legit but make kapp fails, any idea how to prevent this ?

    Updating resource deployment/app-strapi (apps/v1) namespace: env-1000jours-sre-kube-workflow-4y3w36:
      API server says:
        Operation cannot be fulfilled on deployments.apps "app-strapi": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict):
          Recalculated diff:
 11, 10 -     kapp.k14s.io/nonce: "1660057353002414185"
 12, 10 +     kapp.k14s.io/nonce: "1660062422261409721"
223,222 -   progressDeadlineSeconds: 600
225,223 -   revisionHistoryLimit: 10
230,227 -   strategy:
231,227 -     rollingUpdate:
232,227 -       maxSurge: 25%
233,227 -       maxUnavailable: 25%
234,227 -     type: RollingUpdate
237,229 -       creationTimestamp: null
269,260 -         image: something/strapi:sha-3977fb22378f2debdcacf4eeb6dd6f26dab24377
270,260 -         imagePullPolicy: IfNotPresent
271,260 +         image: something/strapi:sha-4ed2921f2fac053671f80fa02b72d124a23fa8c0
276,266 -             scheme: HTTP
279,268 -           successThreshold: 1
285,273 -           protocol: TCP
291,278 -             scheme: HTTP
292,278 +           initialDelaySeconds: 0
297,284 -             cpu: "1"
298,284 +             cpu: 1
300,287 -             cpu: 500m
301,287 +             cpu: 0.5
307,294 -             scheme: HTTP
309,295 -           successThreshold: 1
310,295 -           timeoutSeconds: 1
311,295 -         terminationMessagePath: /dev/termination-log
312,295 -         terminationMessagePolicy: File
316,298 -       dnsPolicy: ClusterFirst
317,298 -       restartPolicy: Always
318,298 -       schedulerName: default-scheduler
319,298 -       securityContext: {}
320,298 -       terminationGracePeriodSeconds: 30

revolunet avatar Aug 09 '22 16:08 revolunet

nonce is something only another kapp deploy would change.

Could you tell us more about how is kapp being used here?

Is it via kapp-controller? Are there more than one users/clients interacting with it using kapp?

To clarify: I mentioned "clients" because kapp might be being used by a CI pipeline, etc.

100mik avatar Aug 09 '22 17:08 100mik

Hi @revolunet!

Is this error consistent or only happens sometimes?

initialDelaySeconds and cpu : i guess its been "rewritten" by kube API

Are you certain about this? Because the error

the object has been modified; please apply your changes to the latest version and try again (reason: Conflict):

would typically mean that after kapp calculated the diff and before it started applying those changes, something updated the resource in the background and hence we get a conflict. Comparing the recalculated diff and the original diff (can be seen using --diff-changes or -c) might help.

praveenrewar avatar Aug 09 '22 17:08 praveenrewar

in this case, kapp is used in a github action and apply some manifests produced with this :

https://github.com/SocialGouv/kube-workflow/blob/27ea1ad20b75fe0b4d5f472fa7d650db8b584436/packages/workflow/src/deploy/index.js#L172-L179

i dont think there is some kapp-controller and yes, many kapp could run in parallel but on different namespaces and this error happen quite often these days, maybe between 5 and 10%.

so i can add --diff-changes=true --diff-context=4 in the code above and get more diff ?

revolunet avatar Aug 09 '22 19:08 revolunet

so i can add --diff-changes=true --diff-context=4 in the code above and get more diff ?

Yeah, comparing the original diff with the recalculated diff would give us an idea of the fields that are getting updated in the background and we could then try to figure out a way to resolve it (maybe a rebase rule to not update those fields).

praveenrewar avatar Aug 09 '22 19:08 praveenrewar

So here's the full diff for the deployment that fails

2,  2   metadata:
  3,  3     annotations:
  4,  4       deployment.kubernetes.io/revision: "1"
  5     -     field.cattle.io/publicEndpoints: '[{"addresses":["51.103.10.142"],"port":443,"protocol":"HTTPS","serviceName":"xxx-develop-91uqrt:app-strapi","ingressName":"xxx-develop-91uqrt:app-strapi","hostname":"xxx","path":"/","allNodes":false}]'
  6,  5       kapp.k14s.io/change-group: kube-workflow/xxx-91uqrt
  7,  6       kapp.k14s.io/change-group.app-strapi: kube-workflow/app-strapi.xxx-91uqrt
  8,  7       kapp.k14s.io/change-rule.restore: upsert after upserting kube-workflow/restore.env-xxx
  9,  8       kapp.k14s.io/create-strategy: fallback-on-update
 10,  9       kapp.k14s.io/disable-original: ""
 11     -     kapp.k14s.io/identity: v1;xxx-develop-91uqrt/apps/Deployment/app-strapi;apps/v1
 12     -     kapp.k14s.io/nonce: "1660064438212705134"
     10 +     kapp.k14s.io/nonce: "1660122041210559682"
 13, 11       kapp.k14s.io/update-strategy: fallback-on-replace
 14, 12     creationTimestamp: "2022-08-09T17:03:48Z"
 15, 13     generation: 2
 16, 14     labels:
  ...
221,219     resourceVersion: "247149463"
222,220     uid: cf981ae2-2372-4ab8-961d-ce3155975a86
223,221   spec:
224     -   progressDeadlineSeconds: 600
225,222     replicas: 1
226     -   revisionHistoryLimit: 10
227,223     selector:
228,224       matchLabels:
229,225         component: app-strapi
230,226         kubeworkflow/kapp: xxx
231     -   strategy:
232     -     rollingUpdate:
233     -       maxSurge: 25%
234     -       maxUnavailable: 25%
235     -     type: RollingUpdate
236,227     template:
237,228       metadata:
238     -       creationTimestamp: null
239,229         labels:
240,230           application: xxx
241,231           component: app-strapi
242,232           kapp.k14s.io/association: v1.9b1e71da08ebc442e6cdc77552cb740a
267,257               name: strapi-configmap
268,258           - secretRef:
269,259               name: pg-user-develop
270     -         image: xxx/strapi:sha-3ab94da32cb3b479804c[796]
271     -         imagePullPolicy: IfNotPresent
    260 +         image: xxx/strapi:sha-6ea5a193875e11b54f4bf333409d1[808]
272,261           livenessProbe:
273,262             failureThreshold: 15
274,263             httpGet:
275,264               path: /_health
276,265               port: http
277     -             scheme: HTTP
278,266             initialDelaySeconds: 30
279,267             periodSeconds: 5
280     -           successThreshold: 1
281,268             timeoutSeconds: 5
282,269           name: app
283,270           ports:
284,271           - containerPort: 1337
285,272             name: http
286     -           protocol: TCP
287,273           readinessProbe:
288,274             failureThreshold: 15
289,275             httpGet:
290,276               path: /_health
291,277               port: http
292     -             scheme: HTTP
    278 +           initialDelaySeconds: 0
293,279             periodSeconds: 5
294,280             successThreshold: 1
295,281             timeoutSeconds: 1
296,282           resources:
297,283             limits:
298     -             cpu: "1"
    284 +             cpu: 1
299,285               memory: 1Gi
300,286             requests:
301     -             cpu: 500m
    287 +             cpu: 0.5
302,288               memory: 256Mi
303,289           startupProbe:
304,290             failureThreshold: 30
305,291             httpGet:
306,292               path: /_health
307,293               port: http
308     -             scheme: HTTP
309,294             periodSeconds: 5
310     -           successThreshold: 1
311     -           timeoutSeconds: 1
312     -         terminationMessagePath: /dev/termination-log
313     -         terminationMessagePolicy: File
314,295           volumeMounts:
315,296           - mountPath: /app/public/uploads
316,297             name: uploads
317     -       dnsPolicy: ClusterFirst
318     -       restartPolicy: Always
319     -       schedulerName: default-scheduler
320     -       securityContext: {}
321     -       terminationGracePeriodSeconds: 30
322,298         volumes:
323,299         - emptyDir: {}
324,300           name: uploads

revolunet avatar Aug 10 '22 09:08 revolunet

And the recalculated diff is the same as what you have shared in the first comment?

If so, I am seeing these 2 differences:

  5     -     field.cattle.io/publicEndpoints: '[{"addresses":["51.103.10.142"],"port":443,"protocol":"HTTPS","serviceName":"xxx-develop-91uqrt:app-strapi","ingressName":"xxx-develop-91uqrt:app-strapi","hostname":"xxx","path":"/","allNodes":false}]'

...snip...

 11     -     kapp.k14s.io/identity: v1;xxx-develop-91uqrt/apps/Deployment/app-strapi;apps/v1

When kapp initially calculates the diff, it tries to remove these fields, but before it could apply the change, the fields are getting removed by something else. Can you think of anything that might be removing these fields? (I am not sure what could be causing kapp to remove the identity annotation in the first place)

praveenrewar avatar Aug 10 '22 10:08 praveenrewar

No its not the same logs, but i can see this on new fails too

 2   metadata:
  3,  3     annotations:
  4,  4       deployment.kubernetes.io/revision: "1"
  5     -     field.cattle.io/publicEndpoints: '[{"addresses":["51.103.10.142"],"port":443,"protocol":"HTTPS","serviceName":"env-xxx-1-5dc5hx:app-strapi","ingressName":"env-xxx-1-5dc5hx:app-strapi","hostname":"backoffice-env-xxx-1-5dc5hx.devr","path":"/","allNodes":false}]'
  6,  5       kapp.k14s.io/change-group: kube-workflow/env-xxx-1-5dc5hx
  7,  6       kapp.k14s.io/change-group.app-strapi: kube-workflow/app-strapi.env-xxx-1-5dc5hx
  8,  7       kapp.k14s.io/change-rule.restore: upsert after upserting kube-workflow/restore.env-xxx-1-5dc5hx
  9,  8       kapp.k14s.io/create-strategy: fallback-on-update
 10,  9       kapp.k14s.io/disable-original: ""
 11     -     kapp.k14s.io/identity: v1;env-xxx-1-5dc5hx/apps/Deployment/app-strapi;apps/v1
 12     -     kapp.k14s.io/nonce: "1660152849643438728"
     10 +     kapp.k14s.io/nonce: "1660164035630293852"
 13, 11       kapp.k14s.io/update-strategy: fallback-on-replace
 14, 12     creationTimestamp: "2022-08-10T17:36:46Z"

Hi, mmmm maybe the cattle.io annotation comes from our rancher when the ingress is provisionned.

can annotations be the cause of a conflict ?

revolunet avatar Aug 10 '22 21:08 revolunet

Re: can annotations be the cause of a conflict ?

If an annotation is added after the initial diff, it might lead to this error. We can configure kapp to use rebaseRules and ask kapp to copy that particular annotation from the resource on the cluster to the resource being applied before calculating the diff.

This would involve adding something like this to your manifests:

apiVersion: kapp.k14s.io/v1alpha1
kind: Config
rebaseRules:
- path: [metadata, annotations, field.cattle.io/publicEndpoints]
  type: copy
  sources: [existing]
  resourceMatchers:
  - apiVersionKindMatcher: {apiVersion: apps/v1, kind: Deployment}

This ensures that the diff remains the same when kapp recalculates the diff before applying the changes.

100mik avatar Aug 10 '22 22:08 100mik

Re: 10, 9 kapp.k14s.io/disable-original: "" 11 - kapp.k14s.io/identity: v1;env-xxx-1-5dc5hx/apps/Deployment/app-strapi;apps/v1

Was the value of the label being used to identify the app (kubeworkflow/kapp) changed at some point?

I can reproduce something similar by doing something like:

  • Create labelled app
kapp deploy -a label:kubeworkflow/kapp=app-name -f - --yes -c << EOF                                                                                                                                                                                                                                                   
apiVersion: v1
kind: ConfigMap
metadata:
  name: asdf
data:
  foo: bar
EOF

(succeeds!)

  • Change the label being used to identify apps
kapp deploy -a label:kubeworkflow=app-name -f - --yes -c << EOF                                                                                                                                                                                                                                                   
apiVersion: v1
kind: ConfigMap
metadata:
  name: asdf
data:
  foo: bar
EOF
Target cluster 'https://192.168.64.11:8443' (nodes: minikube)

@@ update configmap/asdf (v1) namespace: default @@
  ...
  4,  4   metadata:
  5     -   annotations:
  6     -     kapp.k14s.io/identity: v1;default//ConfigMap/asdf;v1
  7,  5     creationTimestamp: "2022-08-10T22:27:10Z"
  8,  6     labels:
  9,  7       kapp.k14s.io/association: v1.e623db5b5c0d55f2a39d467ca3165a7f
 10     -     kubeworkflow/kapp: app-name
      8 +     kubeworkflow: app-name
 11,  9     managedFields:
 12, 10     - apiVersion: v1

Changes

Namespace  Name  Kind       Age  Op      Op st.  Wait to    Rs  Ri  
default    asdf  ConfigMap  31s  update  -       reconcile  ok  -  

Op:      0 create, 0 delete, 1 update, 0 noop, 0 exists
Wait to: 1 reconcile, 0 delete, 0 noop

3:57:41AM: ---- applying 1 changes [0/1 done] ----
3:57:41AM: update configmap/asdf (v1) namespace: default
3:57:41AM: ---- waiting on 1 changes [0/1 done] ----
3:57:41AM: ok: reconcile configmap/asdf (v1) namespace: default
3:57:41AM: ---- applying complete [1/1 done] ----
3:57:41AM: ---- waiting complete [1/1 done] ----

Succeeded

100mik avatar Aug 10 '22 22:08 100mik

Was the value of the label being used to identify the app (kubeworkflow/kapp) changed at some point?

its possible that it changed at some point in the past yes, but now its stable

revolunet avatar Aug 10 '22 22:08 revolunet

After trying to overwrite fields displayed in the diff (initialDelaySeconds and cpu) i end up with :

  Failed to update due to resource conflict (approved diff no longer matches):
    Updating resource deployment/app-strapi (apps/v1) namespace: env-xxx-5dc5hx:
      API server says:
        Operation cannot be fulfilled on deployments.apps "app-strapi": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict):
          Recalculated diff:
 11, 11 -     kapp.k14s.io/nonce: "1660207590418011865"
 12, 11 +     kapp.k14s.io/nonce: "1660209982534815766"
224,224 -   progressDeadlineSeconds: 600
226,225 -   revisionHistoryLimit: 10
231,229 -   strategy:
232,229 -     rollingUpdate:
233,229 -       maxSurge: 25%
234,229 -       maxUnavailable: 25%
235,229 -     type: RollingUpdate
238,231 -       creationTimestamp: null
270,262 -         image: xxx/strapi:sha-1b7c24b0876fdb5c244aa3ada4d96329eb72e1a4
271,262 -         imagePullPolicy: IfNotPresent
272,262 +         image: xxx/strapi:sha-dd16295f5e3d620ffb6874184abbf91f2b304cbf
277,268 -             scheme: HTTP
280,270 -           successThreshold: 1
286,275 -           protocol: TCP
292,280 -             scheme: HTTP
309,296 -             scheme: HTTP
311,297 -           successThreshold: 1
312,297 -           timeoutSeconds: 1
313,297 -         terminationMessagePath: /dev/termination-log
314,297 -         terminationMessagePolicy: File
318,300 -       dnsPolicy: ClusterFirst
319,300 -       restartPolicy: Always
320,300 -       schedulerName: default-scheduler
321,300 -       securityContext: {}
322,300 -       terminationGracePeriodSeconds: 30

revolunet avatar Aug 16 '22 07:08 revolunet

Hey @revolunet ! Since you have changed up the manifest, could you share the initial diff as well? It will really help in finding out what is up. (This is me assuming that the initial diff has changed now, since you had put it in this thread the last time)

100mik avatar Aug 16 '22 09:08 100mik

Ok here's the top of the diff for that deployment :

note 1b7c24b0876fdb5c244aa3ada4d96329eb72e1a4 is the sha of the image currently running in the namespace

update deployment/app-strapi (apps/v1) namespace: env-xxx-5dc5hx @@
  ...
  8,  8       kapp.k14s.io/change-rule.restore: upsert after upserting kube-workflow/restore.env-xxx-5dc5hx
  9,  9       kapp.k14s.io/create-strategy: fallback-on-update
 10, 10       kapp.k14s.io/disable-original: ""
 11     -     kapp.k14s.io/identity: v1;env-xxx-5dc5hx/apps/Deployment/app-strapi;apps/v1
 12     -     kapp.k14s.io/nonce: "1660207590418011865"
     11 +     kapp.k14s.io/nonce: "1660209982534815766"
 13, 12       kapp.k14s.io/update-strategy: fallback-on-replace
 14, 13     creationTimestamp: "2022-08-11T08:49:11Z"
 15, 14     generation: 2
 16, 15     labels:
  ...
222,221     resourceVersion: "247917466"
223,222     uid: 2e7466f0-20aa-452c-9f24-b344a4723716
224,223   spec:
225     -   progressDeadlineSeconds: 600
226,224     replicas: 1
227     -   revisionHistoryLimit: 10
228,225     selector:
229,226       matchLabels:
230,227         component: app-strapi
231,228         kubeworkflow/kapp: xxx
232     -   strategy:
233     -     rollingUpdate:
234     -       maxSurge: 25%
235     -       maxUnavailable: 25%
236     -     type: RollingUpdate
237,229     template:
238,230       metadata:
239     -       creationTimestamp: null
240,231         labels:
241,232           application: xxx
242,233           component: app-strapi
243,234           kapp.k14s.io/association: v1.b90f821a0c6[816](https://github.com/xxx/runs/7783997896?check_suite_focus=true#step:2:837)e919c5ec622aa834cc
  ...
268,259               name: strapi-configmap
269,260           - secretRef:
270,261               name: pg-user-revolunet-patch-1
271     -         image: xxx/strapi:sha-1b7c24b0876fdb5c244aa3ada4d96329eb72e1a4
272     -         imagePullPolicy: IfNotPresent
    262 +         image: xxx/strapi:sha-dd16295f5e3d620ffb6874184abbf91f2b304cbf
273,263           livenessProbe:
274,264             failureThreshold: 15
275,265             httpGet:
276,266               path: /_health
277,267               port: http
278     -             scheme: HTTP
279,268             initialDelaySeconds: 30
280,269             periodSeconds: 5
281     -           successThreshold: 1
282,270             timeoutSeconds: 5
283,271           name: app
284,272           ports:
285,273           - containerPort: 1337
286,274             name: http
287     -           protocol: TCP
288,275           readinessProbe:
289,276             failureThreshold: 15
290,277             httpGet:
291,278               path: /_health
292,279               port: http
293     -             scheme: HTTP
294,280             initialDelaySeconds: 10
295,281             periodSeconds: 5
296,282             successThreshold: 1
297,283             timeoutSeconds: 1
  ...
307,293             httpGet:
308,294               path: /_health
309,295               port: http
310     -             scheme: HTTP
311,296             periodSeconds: 5
312     -           successThreshold: 1
313     -           timeoutSeconds: 1
314     -         terminationMessagePath: /dev/termination-log
315     -         terminationMessagePolicy: File
316,297           volumeMounts:
317,298           - mountPath: /app/public/uploads
318,299             name: uploads
319     -       dnsPolicy: ClusterFirst
320     -       restartPolicy: Always
321     -       schedulerName: default-scheduler
322     -       securityContext: {}
323     -       terminationGracePeriodSeconds: 30
324,300         volumes:
325,301         - emptyDir: {}
326,302           name: uploads

revolunet avatar Aug 16 '22 09:08 revolunet

I see the only conflicting change is the annotation

 11     -     kapp.k14s.io/identity: v1;env-xxx-5dc5hx/apps/Deployment/app-strapi;apps/v1

Could you help me understand what the resource on the cluster looks like a bit better.

Was it previously deployed by kapp? If so, what are the labels and annotations on it? (I am just interested in kapp.k14s.io/..... annotations mainly)

It might be that we are handling some of our own annotations differently while recalculating the diff, I am trying to verify if that is indeed the case 🤔

100mik avatar Aug 16 '22 09:08 100mik

So on the previous deploy, made with kapp (currently up on the cluster) we have :

kapp.k14s.io/change-group: kube-workflow/env-xxx-5dc5hx
kapp.k14s.io/change-group.app-strapi: kube-workflow/app-strapi.env-xxx-5dc5hx
kapp.k14s.io/change-rule.restore: upsert after upserting kube-workflow/restore.env-xxx-5dc5hx
kapp.k14s.io/create-strategy: `fallback-on-update`
kapp.k14s.io/disable-original: ""
kapp.k14s.io/identity: v1;env-xxx-5dc5hx/apps/Deployment/app-strapi;apps/v1
kapp.k14s.io/nonce: "1660207590418011865"
kapp.k14s.io/update-strategy: fallback-on-replace

revolunet avatar Aug 16 '22 09:08 revolunet

Does the deployment currently have the label kubeworkflow/kapp ? (the one being supplied to kapp as well - -a kubeworkflow/kapp)

100mik avatar Aug 16 '22 09:08 100mik

sorry, missed the labels :

labels:
    application: xxx
    component: app-strapi
    kapp.k14s.io/association: v1.b90f821a0c6816e919c5ec622aa834cc
    kubeworkflow/kapp: xxx

revolunet avatar Aug 16 '22 09:08 revolunet

Thanks for the prompt replies!

Gonna take a closer look at this, this is definitely not expected. However, I cannot reproduce the exact issue y'all have been running into :(

The closest I could get was over here in the similar reproduction I posted, where kapp shows that the identity annotation is being removed when it is not.

Marking this as a bug for now, since looks like the metadata on the deployment is as expected (assuming that env-xxx-5dc5hx is the ns you are working with)

100mik avatar Aug 16 '22 10:08 100mik

Thanks for your help, we're digging here too. yes, the ns is env-xxx-5dc5hx.

revolunet avatar Aug 16 '22 10:08 revolunet

meanwhile, any strategy to force the deployment ?

revolunet avatar Aug 16 '22 11:08 revolunet

Heyo! Sorry for the delay I was verifying a few options.

For the time being you could add the following kapp Config to you manifests:

apiVersion: kapp.k14s.io/v1alpha1
kind: Config

diffAgainstLastAppliedFieldExclusionRules:
- path: [metadata, annotations, "kapp.k14s.io/identity"]
  resourceMatchers:
  - apiVersionKindMatcher: {apiVersion: apps/v1, kind: Deployment}

This would exclude the problematic field while diffing all together.

If you already have a kapp Config you can just amend it with:

diffAgainstLastAppliedFieldExclusionRules:
- path: [metadata, annotations, "kapp.k14s.io/identity"]
  resourceMatchers:
  - apiVersionKindMatcher: {apiVersion: apps/v1, kind: Deployment}

100mik avatar Aug 17 '22 16:08 100mik

Do let us know if this solution works out for you! Thanks for reporting this, we will be looking into this behaviour.

For reference while prioritisation, As a part of our diffing process we remove certain annotations added by kapp before generating a diff. However, the identity annotation is not one of these. This might be one of the reasons causing this.

It is also worth noting that even though this behaviour is noted in labelled apps, it is not observed in recorded apps.

Next steps would be to identify how the pre-diff processing impacts recorded and labelled apps differently.

100mik avatar Aug 17 '22 21:08 100mik

@revolunet We will drop a ping on this issue when we have a release which resolves this.

100mik avatar Aug 17 '22 21:08 100mik

Thanks for the follow-up; we have quite a specific use-case and maybe this is not kapp related but due to some other misconfig but i prefer to share in case this can help anyone in that situation :)

kapp works perfectly in most cases and really helps when deploying a bunch of manifests with dependencies 💯

I tried to add your kapp config as a ConfigMap into our manifests YAML output but it didnt help : https://github.com/SocialGouv/1000jours/pull/1380/commits/a81b816b71dc995690b64012d5bad9be02108983 i'm not sure if declaring the ConfigMap in the YAML passed to kapp deploy is enough though.

revolunet avatar Aug 17 '22 22:08 revolunet

Hey !!

I finally resolved this issue that was occasioned by many factors (but finally only one was determining),

hypothesis # 1 The Bad

first, rancher was adding metadata.annotations."field.cattle.io/publicEndpoints" and the fix you gave us, using rebase rule is working for this issue, this is now patched in kube-workflow (legacy) and kontinuous @revolunet here are the fix (you could also put this content in the file created here https://github.com/SocialGouv/1000jours/commit/a81b816b71dc995690b64012d5bad9be02108983 the format I use is to be consumed by cli, the other is to be consumed by kapp kube controller which we don't use):

  • https://github.com/SocialGouv/kube-workflow/commit/dc8140b831af1c1825229600872561690915c930
  • https://github.com/SocialGouv/kontinuous/commit/f060c10d8398f83269b67bcae2d0611132d5eb02

hypothesis # 2 The Ugly

#kapp+sealed-secret+reloader but the other thing that was breaking everything was, the combination of sealed-secret + reloader theses tools are compatibles but the behavior of the both combined with kapp is not, here is the process that break things:

  • kapp create/update sealed-secret resources on the cluster
  • the sealed-secret operator unseal the secret(s) and create/update the secret on the cluster
  • reloader operator detect new secret and restart the deployment making an update, now the deployment is not same version as before I don't now what is the better approach to solve this, if it has to be solved at reloader, sealed-secret or kapp level. But at this time, I don't see option that can be used on any of these tools to resolve the conflict. At this time the only solution to workaround this is to not use reloader and to implement kapp versioned-resources to ensure that the last version of unsealed secret will be used by deployment. (finally, I'm not sure there is an issue here, but I share it to have your feedback if you have any on this and are thinking to a thing that I don't)

hypothesis # 3 The Good one

Finally, a thing that I didn't understood, it was the link between the command in the job and the deployment. When we had pg_restore in the job that was failing but when we replaced by sleep 240 (according to time that was taking to run pg_restore) it was working. I was first thinking that was related ressources used, so I reserved large ressources for the job. But that was impacting even the rancher annotations (maybe the network usage had a side effect on operator, modifying the global behavior, very weird I was thinking). Finally, after disabled reloader, the deployment doesn't seem to reboot, so I was thinking it was resolved, but few try later, the deployment started to reboot on kapp deploy before job endeed (the job is in change group that is required by change rule on deployment). Sorry for the unsustainable suspens (but it take me tens of hours)... It was the pod that was crashing. I totally didn't knew how this service was supposed to work, but there was a poll every few seconds that was interracting with DB, and while the pg_restore was running, inconsistent data made it crash and restart. This restart, done by kube-controller-manager was making change on the manifests. I don't know if this is an issue that can (and should) be treated at kapp level. But for now we can resolve this on our side.

Sorry for big bazzard (and excuse me for my poor english). Thanks for your help and patience. And big up for developing this great tool that is kapp, we are using it every day !

devthejo avatar Aug 20 '22 12:08 devthejo

workaround this is to not use reloader and to implement kapp versioned-resources to ensure that the last version of unsealed secret will be used by deployment.

This is what I was about to suggest when you mentioned you are using reloader! This would ensure that every part of the update is handled by kapp. It might reduce some overhead as well!

Sorry for the unsustainable suspens (but it take me tens of hours)...

No worries! Happy to hack through this with you

I don't know if this is an issue that can (and should) be treated at kapp level. But for now we can resolve this on our side.

Trying to process all the information, but two thoughts come to mind.

  1. Are the change rules working as expected?
  2. Are you using versioned resources to update the deployment now?

And big up for developing this great tool that is kapp, we are using it every day !

We are glad it helps!

100mik avatar Aug 22 '22 05:08 100mik

I am marking this issue as "helping with an issue" for the time being. Mainly because it seems like there is a lot in your environment that we are not aware of, which makes reproducing the issue exactly difficult.

If something that warrants a change on our side surfaces, we will definitely prioritise it!

100mik avatar Aug 22 '22 06:08 100mik

Are the change rules working as expected?

Yes, thanks.

Are you using versioned resources to update the deployment now?

We are working on it, but it will be the case soon.

We have encountered others issues with kube-controller-manager changing annotations and making kapp fail. Each time we have a restarting Deployment because of failure during kapp deploy command is running we have conflict. To resolve this, we have now a detection of these case and cleanup before running kapp deploy, but it would be better if we could make the difference between the change caused by the standard kube-controller-manager automatically restarting the deployment and others, but I can't say if it's possible in my knowledge.

For sharing with you on kubernetes CI/CD concerns, another subject, a little related because we use that to detect and clean problematic deployments, we use also now another tool in parallel of kapp that we have forked recently and that is able to detect commons errors on deployment to allow fail fast and better debugging messages, maybe in future theses features could be integrated in kapp. It's coded in Golang: https://github.com/SocialGouv/rollout-status we have added StatefulSet handling (that was handling only Deployment errors) and some options. The version of kubernetes lib is old, but work pretty well on recent kubernetes server.

devthejo avatar Sep 03 '22 20:09 devthejo

we use also now another tool in parallel of kapp that we have forked recently and that is able to detect commons errors on deployment to allow fail fast and better debugging messages, maybe in future theses features could be integrated in kapp.

Thank you so much for sharing. We will definitely take a look at it and let you know our next steps :)

praveenrewar avatar Sep 06 '22 04:09 praveenrewar

We have too many issues on conflict with our CI

kube-controller-manager changing annotations and making kapp fail. Each time we have a restarting Deployment because of failure during kapp deploy command is running we have conflict. To resolve this, we have now a detection of these case and cleanup before running kapp deploy, but it would be better if we could make the difference between the change caused by the standard kube-controller-manager automatically restarting the deployment and others, but I can't say if it's possible in my knowledge.

to resolve this I think we could have a tag to force --force-conflicts like here https://kubernetes.io/docs/reference/using-api/server-side-apply/#conflicts I can make a PR adding conditions here (etc...) https://github.com/vmware-tanzu/carvel-kapp/blob/9e863ee3668282e236e34142b221d75629e637a4/pkg/kapp/clusterapply/add_or_update_change.go#L151

What do you think ?

devthejo avatar Sep 20 '22 11:09 devthejo