kube-image-keeper icon indicating copy to clipboard operation
kube-image-keeper copied to clipboard

"Sporadic" controller crash when Pods are in "Terminating" status

Open Nicolasgouze opened this issue 10 months ago • 10 comments

@denniskern : Following our short conversation, can you please provide logs and as much info as possible regarding this issue ?

On our side :

  • we do not reproduce this issue (tested with 1.6 & 1.7 releases)
  • @paullaffitte checked the code.

We want to be sure we do not miss anything ...

Nicolasgouze avatar Apr 08 '24 08:04 Nicolasgouze

May be related to #308

plaffitt avatar Apr 08 '24 11:04 plaffitt

If this issue appeared with version 1.7.0, I think we can confirm that it's the same issue as #308 and thus resolved by #311. @denniskern could you confirm please?

plaffitt avatar Apr 09 '24 12:04 plaffitt

Sorry for late response - I was on leave.

Yes exactly this happend - If you guys have a version out there where I can test #311 I would do it.

denniskern avatar Apr 18 '24 08:04 denniskern

Hi @denniskern , The PR #311 (including a fix) was merged today, and will be available in next release. I will wait for your test to close this ticket.

Nicolasgouze avatar Apr 19 '24 12:04 Nicolasgouze

@Nicolasgouze that we are not pass each other - I need a release to test it :-)

denniskern avatar Apr 23 '24 10:04 denniskern

@Nicolasgouze @paullaffitte Would you be able to release a beta version so we're able to test that easier? I saw that you did that earlier already. Would help us in getting you the required information more quick.

spr-mweber3 avatar Apr 25 '24 04:04 spr-mweber3

Hello @spr-mweber3 , we'll release a beta version next monday. stay tuned !

Nicolasgouze avatar Apr 26 '24 13:04 Nicolasgouze

@denniskern @spr-mweber3 v1.8.1-beta.1 is available!

plaffitt avatar Apr 29 '24 09:04 plaffitt

I tested version v1.8.1-beta.1 and the crash of the controller still exists. The controller crash with this message:

2024-05-06T14:31:02.140Z ERROR setup problem running manager {"error": "Pod "kube-prometheus-stack-admission-create-stq7n" is invalid: spec: Forbidden: pod updates may not change fields other than spec.containers[*].image,spec.initContainers[*].image,spec.activeDeadlineSeconds,spec.tolerations (only additions to existing tolerations),spec.terminationGracePeriodSeconds (allow it to be set to 1 if it was previously negative)\n core.PodSpec{\n \t... // 6 identical fields\n \tActiveDeadlineSeconds: nil,\n \tDNSPolicy: "ClusterFirst",\n- \tNodeSelector: nil,\n+ \tNodeSelector: map[string]string{"workergroup": "wg1"},\n \tServiceAccountName: "kube-prometheus-stack-admission",\n \tAutomountServiceAccountToken: nil,\n \t... // 22 identical fields\n }\n"}

The pod kube-prometheus-stack-admission-create-stq7n is in state terminating.

denniskern avatar May 06 '24 14:05 denniskern

It looks like we have another problem here. But it is very surprising because this error appears in the initialization step as suggests the log "setup problem running manager" and the only update that we do on pods during initialization is no-op (p.Client.Patch(context.Background(), &pod, client.RawPatch(types.JSONPatchType, []byte("[]")))). The goal is to trigger to mutating webhook on all existing pods. And in this mutating webhook we only rewrite images and add annotations, which should not be an issue either..

What version of Kubernetes are you using please?

plaffitt avatar May 16 '24 14:05 plaffitt

We are using 1.27

denniskern avatar May 17 '24 06:05 denniskern

Sorry but I cannot reproduce your issue on a cluster in version 1.27. Is there anything specific in your setup? If you could produce a minimal reproducible example it would greatly help.

plaffitt avatar May 22 '24 12:05 plaffitt

Hi @denniskern , Do you have any further info to provide so that we try to reproduce & finally correct the issue. Thanks in advance !

Nicolasgouze avatar May 31 '24 12:05 Nicolasgouze

Hi guys @paullaffitte @Nicolasgouze

I figured out that the problem has nothing to do with the state of the pod rather then with a clusterpolicy which is handled by kyverno. In our case we have a clusterpolicy which add a NodeSelector to the pod and because of a timing problem the policy was later in place then the pod spawned. So this means when kuik wants to replace the image location the clusterpolicy wants also to update the NodeSelector and this is not allowed which leads to this error.

But something must have changed since version 1.7.0 because we don't see this behavior from kuik in version 1.6.0

Since we fixed the policy it now works fine.

Thanks a lot for your support!

denniskern avatar May 31 '24 14:05 denniskern

Hi @denniskern ,

Thanks a lot for the explanation provided !

It will not come shortly (we currently have other items under development) & would have not given you 100% of the rootcause in your scenario (because of the timing + because "we don't know what we don't know") but we think about working on a "diagnosis tools" that will run on kuik startup in order to check that all cluster pre-requisites are fine before kuik services really starts.

Nicolasgouze avatar Jun 03 '24 07:06 Nicolasgouze

Got the same error, also related to kyverno cpol. I think this should not crash the entire controller, just log an error. Let me know if you want I purpose a PR to fix this.

devthejo avatar Jul 23 '24 15:07 devthejo