Catalog Source Pod is not recreated when transitioned to the terminated state
Bug Report
What did you do?
- Have a catalog source.
- catalog source generates a pod on node A.
- node A get restarted/replaced.
- Pod not replaced
What did you expect to see? Pod is recreated.
What did you see instead? Under which circumstances? Dead pod not replaced. But deleting the dead pods manually will trigger recreate.
Environment
- operator-lifecycle-manager version:
OLM version: v0.20.0
git commit: e6428a19b52d2fd7e689577d7be55223b1b2e5f8
- Kubernetes version information:
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.6-gke.1500", GitCommit:"5595443086b60d8c5c62342fadc2d4fda9c793e8", GitTreeState:"clean", BuildDate:"2022-02-09T09:25:03Z", GoVersion:"go1.16.12b7", Compiler:"gc", Platform:"linux/amd64"}
- Kubernetes cluster kind: GKE
Possible Solution
Check for this pod condition and replace it.
Or
One of the comment in https://github.com/operator-framework/operator-lifecycle-manager/issues/2666 suggests to make CatalogSource Pod controller back (deployment/ss), that should also resolve this.
Additional context
Pod status
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-03-24T18:48:24Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-03-24T21:02:39Z"
message: 'containers with unready status: [registry-server]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-03-24T21:02:39Z"
message: 'containers with unready status: [registry-server]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-03-24T18:48:24Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://d327b53da13df65232bfaa19120022a6a96af630350ac43fb51a17b15f0c55bb
image: quay.io/operatorhubio/catalog:latest
imageID: quay.io/operatorhubio/catalog@sha256:009ba4d793616312c7a847dd4a64455971b2d7d68a5d2a16e76d6df3ce03eedc
lastState: {}
name: registry-server
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://d327b53da13df65232bfaa19120022a6a96af630350ac43fb51a17b15f0c55bb
exitCode: 0
finishedAt: "2022-03-24T21:02:38Z"
reason: Completed
startedAt: "2022-03-24T18:48:28Z"
hostIP: 10.100.4.19
message: Pod was terminated in response to imminent node shutdown.
phase: Failed
podIP: 10.100.18.45
podIPs:
- ip: 10.100.18.45
qosClass: Burstable
reason: Terminated
startTime: "2022-03-24T18:48:24Z"```
Hi @Gentoli,
Thanks for bringing this up -- we know this is affecting users and is poor UX (to not have the catalog source pod be managed by a built-in controller). We will open up an RFE on the JIRA board and see that we can get this work prioritized.