operator-sdk
operator-sdk copied to clipboard
olm install - operatorhubio-catalog Readiness probe failed
Bug Report
What did you do?
Install the olm using the "operator-sdk olm install"
What did you expect to see?
The pods in olm namespace in "ready" state.
What did you see instead? Under which circumstances?
The operatorhubio-catalog is not ready:
k get pods -n olm

-> describe pod operatorhubio-catalog-sxpwz -n olm

Environment
Minikube 1.25.2
Operator type:
/language go
Kubernetes cluster type:
$ operator-sdk version
operator-sdk version: "v1.22.1", commit: "46ab175459a775d2fb9f0454d0b4a8850dd745ed", kubernetes version: "1.24.1", go version: "go1.18.3", GOOS: "linux", GOARCH: "amd64"
$ go version (if language is Go)
go version go1.18 linux/amd64
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}
Possible Solution
Additional context
@fredcsfaria we would need more context on the cluster which is being used. Usually this error occurs when there are not enough hosts available to schedule the pods on, or maybe when there is a networking issue with the node where the particular pod is being scheduled. Are there any logs from the pods which is being started but not healthy? Or any errors from the API server which can be used to dig into this?
I'm also seeing this behaviour:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
catalog-operator-7f6fdbc478-6b8j2 1/1 Running 0 110m
olm-operator-7f5fbcf9f9-bhljf 1/1 Running 0 110m
operatorhubio-catalog-v5w7b 0/1 CrashLoopBackOff 7 14m
operatorhubio-catalog-x2hbh 0/1 CrashLoopBackOff 19 49m
packageserver-7cf6d4dc5d-4w7xc 1/1 Running 0 109m
packageserver-7cf6d4dc5d-lr8qv 1/1 Running 0 109m
$ kubectl describe po operatorhubio-catalog-v5w7b
Name: operatorhubio-catalog-v5w7b
Namespace: olm
Priority: 0
Node: node1/10.10.168.144
Start Time: Tue, 09 Aug 2022 00:12:37 +0000
Labels: olm.catalogSource=operatorhubio-catalog
olm.pod-spec-hash=6bfddc89d8
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: true
cni.projectcalico.org/podIP: 10.233.90.178/32
cni.projectcalico.org/podIPs: 10.233.90.178/32
Status: Running
IP: 10.233.90.178
IPs:
IP: 10.233.90.178
Containers:
registry-server:
Container ID: docker://abc02bb38b14eb1fd7b93f14a8209223eeb587d057d75ccefe8c919740309600
Image: quay.io/operatorhubio/catalog:latest
Image ID: docker-pullable://quay.io/operatorhubio/catalog@sha256:b6492a9fd6fc07ac0975939cd5828151e47b730eda130c75a109bcee3b93a969
Port: 50051/TCP
Host Port: 0/TCP
State: Running
Started: Tue, 09 Aug 2022 00:27:19 +0000
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Tue, 09 Aug 2022 00:21:31 +0000
Finished: Tue, 09 Aug 2022 00:22:07 +0000
Ready: False
Restart Count: 8
Requests:
cpu: 10m
memory: 50Mi
Liveness: exec [grpc_health_probe -addr=:50051] delay=10s timeout=5s period=10s #success=1 #failure=3
Readiness: exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from operatorhubio-catalog-token-xrjtx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
operatorhubio-catalog-token-xrjtx:
Type: Secret (a volume populated by a Secret)
SecretName: operatorhubio-catalog-token-xrjtx
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 15m default-scheduler Successfully assigned olm/operatorhubio-catalog-v5w7b to node1
Normal Pulled 15m kubelet Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 405.142183ms
Normal Created 14m (x2 over 15m) kubelet Created container registry-server
Normal Pulled 14m kubelet Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 361.608421ms
Normal Started 14m (x2 over 15m) kubelet Started container registry-server
Warning Unhealthy 13m (x7 over 14m) kubelet Readiness probe failed: timeout: failed to connect service ":50051" within 1s
Normal Pulling 13m (x3 over 15m) kubelet Pulling image "quay.io/operatorhubio/catalog:latest"
Warning Unhealthy 13m (x6 over 14m) kubelet Liveness probe failed: timeout: failed to connect service ":50051" within 1s
Normal Killing 13m (x2 over 14m) kubelet Container registry-server failed liveness probe, will be restarted
Normal Pulled 13m kubelet Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 514.682595ms
Warning BackOff 4m56s (x28 over 11m) kubelet Back-off restarting failed container
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:40:09Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:32:49Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
I ran the following experiment, manually running the entrypoint of the container
$ docker run --rm -it --entrypoint=sh quay.io/operatorhubio/catalog:latest
sh-4.4$ /bin/opm serve /configs/
WARN[0000] unable to set termination log path error="open /dev/termination-log: permission denied"
INFO[0048] serving registry configs=/configs/ port=50051
It took around 50 seconds for the serving registry log message to appear. I believe that is the source of the issue I am experiencing - the liveness/readiness probes are failing because it takes the pod too long to initialize and the pod gets stuck in a restart loop because the initialDelaySeconds for the probes is too short.
Looks like my issue will be fixed in the next release: This PR tackles the issue 👍 Thanks
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Fixed by https://github.com/operator-framework/operator-lifecycle-manager/pull/2791
K8S version: v1.27.3
I'm still facing this issue
kubectl create -f \ https://raw.githubusercontent.com/operator-framework/operator-lifecycle-manager/master/deploy/upstream/quickstart/crds.yaml kubectl create -f \ https://raw.githubusercontent.com/operator-framework/operator-lifecycle-manager/master/deploy/upstream/quickstart/olm.yaml
`Events: Type Reason Age From Message
Normal Scheduled 75s default-scheduler Successfully assigned olm/operatorhubio-catalog-n2cp2 to aks-systempool-27289948-vmss00000c Normal Pulled 73s kubelet Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 716.734417ms (716.844818ms including waiting) Normal Pulling 34s (x2 over 74s) kubelet Pulling image "quay.io/operatorhubio/catalog:latest" Normal Killing 34s kubelet Container registry-server failed liveness probe, will be restarted Warning Unhealthy 34s kubelet Readiness probe errored: rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task d796b71bb52dc8516ad4f1b347934703c842b488d855fde20918bd1021c7172a not found: not found Normal Created 33s (x2 over 73s) kubelet Created container registry-server Normal Started 33s (x2 over 73s) kubelet Started container registry-server Normal Pulled 33s kubelet Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 605.963458ms (605.976558ms including waiting) Warning Unhealthy 4s (x7 over 64s) kubelet Readiness probe failed: timeout: failed to connect service ":50051" within 1s Warning Unhealthy 4s (x5 over 54s) kubelet Liveness probe failed: timeout: failed to connect service ":50051" within 1s `