operator-lifecycle-manager icon indicating copy to clipboard operation
operator-lifecycle-manager copied to clipboard

patching catalogsource pod image registry prefix using mutating webhook causing the pod to terminating and restart

Open shixuguang opened this issue 2 years ago • 2 comments

Bug Report

What did you do? trying to patch catalog source pod image registry prefix to point to different one than what's defined in catalog source spec image filed, also patching pod imagePullSecrets to add a pull secret that has access to registry prefix, both are done using mutation webhook at pod creating time

What did you expect to see? catalog source pod is able to pull image from patched registry prefix and pod is able to start

What did you see instead? Under which circumstances? though pod is patched with both registry prefix and imagePullSecrets, however, pod container is immediately killed soon after it's started, causing multiple copies of the pod flipping in an infinite loop of Terminating and ContainerCreating state:

  Normal  Created         <invalid>  kubelet            Created container registry-server
  Normal  Started         <invalid>  kubelet            Started container registry-server
  Normal  Killing         <invalid>  kubelet            Stopping container registry-server
NAME                             READY   STATUS              RESTARTS   AGE
nginx-operator-catalog-hbwhn   0/1     Terminating         0          1s
nginx-operator-catalog-hqckk   0/1     Terminating         0          2s
nginx-operator-catalog-mwfmg   0/1     Terminating         0          4s
nginx-operator-catalog-psn24   0/1     ContainerCreating   0          0s

please note that if registry prefix is coded in catalogsource spec image and only patch imagePullSecrets, it'd be working fine, we have a case to code generic registry prefix in catalogsource spec and override with environment specific one at runtime, and global pull prefix might not be available

Environment

  • operator-lifecycle-manager version:
OLM version: 0.19.0
git commit: 3a667ecc956dcaacf47eb748ec2f2882633fc150
  • Kubernetes version information: v1.24.12+ceaf338
  • Kubernetes cluster kind: openshift Possible Solution

Additional context

catalog operator pod log:

time="2023-06-28T04:42:34Z" level=debug msg="syncing catsrc" id=HojpG source=nginx-operator-catalog
time="2023-06-28T04:42:34Z" level=debug msg="check registry server healthy: false" id=HojpG source=nginx-operator-catalog
time="2023-06-28T04:42:34Z" level=debug msg="ensuring registry server" id=HojpG source=nginx-operator-catalog
time="2023-06-28T04:42:35Z" level=debug msg="handling object deletion" name=nginx-operator-catalog-nqp44 namespace=default
time="2023-06-28T04:42:35Z" level=debug msg="handling object deletion" name=nginx-operator-catalog-wlhjw namespace=default
time="2023-06-28T04:42:36Z" level=debug msg="ensured registry server" id=HojpG source=nginx-operator-catalog
time="2023-06-28T04:42:36Z" level=debug msg="syncing catsrc" id=goLzt source=nginx-operator-catalog
time="2023-06-28T04:42:36Z" level=debug msg="check registry server healthy: false" id=goLzt source=nginx-operator-catalog
time="2023-06-28T04:42:36Z" level=debug msg="ensuring registry server" id=goLzt source=nginx-operator-catalog
time="2023-06-28T04:42:36Z" level=debug msg="syncing catalog source for annotation templates" catSrcName=nginx-operator-catalog catSrcNamespace=default id=zBVjZ
time="2023-06-28T04:42:36Z" level=debug msg="this catalog source is not participating in template replacement" catSrcName=nginx-operator-catalog catSrcNamespace=default id=zBVjZ
time="2023-06-28T04:42:36Z" level=debug msg="RemoveStatusConditions - request to remove status conditions did not result in any changes, so updates were not made" catSrcName=nginx-operator-catalog catSrcNamespace=default id=zBVjZ
time="2023-06-28T04:42:37Z" level=debug msg="handling object deletion" name=nginx-operator-catalog-km924 namespace=default
time="2023-06-28T04:42:37Z" level=debug msg="handling object deletion" name=nginx-operator-catalog-5bwb2 namespace=default
time="2023-06-28T04:42:37Z" level=debug msg="ensured registry server" id=goLzt source=nginx-operator-catalog
time="2023-06-28T04:42:37Z" level=error msg="UpdateStatus - error while setting CatalogSource status" error="Operation cannot be fulfilled on catalogsources.operators.coreos.com \"nginx-operator-catalog\": the object has been modified; please apply your changes to the latest version and try again" id=goLzt source=nginx-operator-catalog
E0628 04:42:37.232862       1 queueinformer_operator.go:290] sync {"update" "default/nginx-operator-catalog"} failed: Operation cannot be fulfilled on catalogsources.operators.coreos.com "nginx-operator-catalog": the object has been modified; please apply your changes to the latest version and try again

shixuguang avatar Jun 28 '23 04:06 shixuguang

@shixuguang this is expected. The pods are owned by the CatalogSource CRs, so if you try to edit the pod, the catalog-operator reverts it back. So essentially you have the catalog-operator and the webhook fighting each other. The webhook goes and edits the pods, the catalog-operator reverts it back, and this cycle goes on infinitely.

It's the CatalogSource CR that needs to be patched.

anik120 avatar Jun 29 '23 17:06 anik120

@anik120 Why is CatalogSource managing pods directly instead of managing a deployment?

Many mutating webhooks modify pods. Its not reasonable to expect each one to understand CatalogSources and patch those instead.

Additionally, if server-side apply was used by the CatalogSource controller, I don't believe this issue would have happened.

This was just the result of production issues for us due to the pressure it was putting on kube-proxy. I do think such behaviour is undesired and should be fixed, rather than requiring the rest of the Kubernetes ecosystem to work around olm's decision to manage pods directly (something that is generally discouraged).

cm3lindsay avatar Oct 18 '23 02:10 cm3lindsay