operator-sdk icon indicating copy to clipboard operation
operator-sdk copied to clipboard

olm install - operatorhubio-catalog Readiness probe failed

Open fredcsfaria opened this issue 3 years ago • 3 comments

Bug Report

What did you do?

Install the olm using the "operator-sdk olm install"

What did you expect to see?

The pods in olm namespace in "ready" state.

What did you see instead? Under which circumstances?

The operatorhubio-catalog is not ready:

k get pods -n olm image

-> describe pod operatorhubio-catalog-sxpwz -n olm image

Environment

Minikube 1.25.2

Operator type:

/language go

Kubernetes cluster type:

$ operator-sdk version

operator-sdk version: "v1.22.1", commit: "46ab175459a775d2fb9f0454d0b4a8850dd745ed", kubernetes version: "1.24.1", go version: "go1.18.3", GOOS: "linux", GOARCH: "amd64"

$ go version (if language is Go)

go version go1.18 linux/amd64

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}

Possible Solution

Additional context

fredcsfaria avatar Jul 12 '22 13:07 fredcsfaria

@fredcsfaria we would need more context on the cluster which is being used. Usually this error occurs when there are not enough hosts available to schedule the pods on, or maybe when there is a networking issue with the node where the particular pod is being scheduled. Are there any logs from the pods which is being started but not healthy? Or any errors from the API server which can be used to dig into this?

varshaprasad96 avatar Jul 18 '22 18:07 varshaprasad96

I'm also seeing this behaviour:

$ kubectl get po

NAME                                READY   STATUS             RESTARTS   AGE
catalog-operator-7f6fdbc478-6b8j2   1/1     Running            0          110m
olm-operator-7f5fbcf9f9-bhljf       1/1     Running            0          110m
operatorhubio-catalog-v5w7b         0/1     CrashLoopBackOff   7          14m
operatorhubio-catalog-x2hbh         0/1     CrashLoopBackOff   19         49m
packageserver-7cf6d4dc5d-4w7xc      1/1     Running            0          109m
packageserver-7cf6d4dc5d-lr8qv      1/1     Running            0          109m
$ kubectl describe po operatorhubio-catalog-v5w7b
Name:         operatorhubio-catalog-v5w7b
Namespace:    olm
Priority:     0
Node:         node1/10.10.168.144
Start Time:   Tue, 09 Aug 2022 00:12:37 +0000
Labels:       olm.catalogSource=operatorhubio-catalog
              olm.pod-spec-hash=6bfddc89d8
Annotations:  cluster-autoscaler.kubernetes.io/safe-to-evict: true
              cni.projectcalico.org/podIP: 10.233.90.178/32
              cni.projectcalico.org/podIPs: 10.233.90.178/32
Status:       Running
IP:           10.233.90.178
IPs:
  IP:  10.233.90.178
Containers:
  registry-server:
    Container ID:   docker://abc02bb38b14eb1fd7b93f14a8209223eeb587d057d75ccefe8c919740309600
    Image:          quay.io/operatorhubio/catalog:latest
    Image ID:       docker-pullable://quay.io/operatorhubio/catalog@sha256:b6492a9fd6fc07ac0975939cd5828151e47b730eda130c75a109bcee3b93a969
    Port:           50051/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 09 Aug 2022 00:27:19 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 09 Aug 2022 00:21:31 +0000
      Finished:     Tue, 09 Aug 2022 00:22:07 +0000
    Ready:          False
    Restart Count:  8
    Requests:
      cpu:        10m
      memory:     50Mi
    Liveness:     exec [grpc_health_probe -addr=:50051] delay=10s timeout=5s period=10s #success=1 #failure=3
    Readiness:    exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from operatorhubio-catalog-token-xrjtx (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  operatorhubio-catalog-token-xrjtx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  operatorhubio-catalog-token-xrjtx
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node-role.kubernetes.io/master:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  15m                   default-scheduler  Successfully assigned olm/operatorhubio-catalog-v5w7b to node1
  Normal   Pulled     15m                   kubelet            Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 405.142183ms
  Normal   Created    14m (x2 over 15m)     kubelet            Created container registry-server
  Normal   Pulled     14m                   kubelet            Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 361.608421ms
  Normal   Started    14m (x2 over 15m)     kubelet            Started container registry-server
  Warning  Unhealthy  13m (x7 over 14m)     kubelet            Readiness probe failed: timeout: failed to connect service ":50051" within 1s
  Normal   Pulling    13m (x3 over 15m)     kubelet            Pulling image "quay.io/operatorhubio/catalog:latest"
  Warning  Unhealthy  13m (x6 over 14m)     kubelet            Liveness probe failed: timeout: failed to connect service ":50051" within 1s
  Normal   Killing    13m (x2 over 14m)     kubelet            Container registry-server failed liveness probe, will be restarted
  Normal   Pulled     13m                   kubelet            Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 514.682595ms
  Warning  BackOff    4m56s (x28 over 11m)  kubelet            Back-off restarting failed container
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:40:09Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:32:49Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

I ran the following experiment, manually running the entrypoint of the container

$ docker run --rm -it --entrypoint=sh quay.io/operatorhubio/catalog:latest
sh-4.4$ /bin/opm serve /configs/
WARN[0000] unable to set termination log path            error="open /dev/termination-log: permission denied"

INFO[0048] serving registry                              configs=/configs/ port=50051

It took around 50 seconds for the serving registry log message to appear. I believe that is the source of the issue I am experiencing - the liveness/readiness probes are failing because it takes the pod too long to initialize and the pod gets stuck in a restart loop because the initialDelaySeconds for the probes is too short.

murphd40 avatar Aug 09 '22 00:08 murphd40

Looks like my issue will be fixed in the next release: This PR tackles the issue 👍 Thanks

murphd40 avatar Aug 10 '22 19:08 murphd40

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Nov 09 '22 01:11 openshift-bot

Fixed by https://github.com/operator-framework/operator-lifecycle-manager/pull/2791

jmrodri avatar Nov 22 '22 05:11 jmrodri

K8S version: v1.27.3

I'm still facing this issue

kubectl create -f \ https://raw.githubusercontent.com/operator-framework/operator-lifecycle-manager/master/deploy/upstream/quickstart/crds.yaml kubectl create -f \ https://raw.githubusercontent.com/operator-framework/operator-lifecycle-manager/master/deploy/upstream/quickstart/olm.yaml

`Events: Type Reason Age From Message


Normal Scheduled 75s default-scheduler Successfully assigned olm/operatorhubio-catalog-n2cp2 to aks-systempool-27289948-vmss00000c Normal Pulled 73s kubelet Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 716.734417ms (716.844818ms including waiting) Normal Pulling 34s (x2 over 74s) kubelet Pulling image "quay.io/operatorhubio/catalog:latest" Normal Killing 34s kubelet Container registry-server failed liveness probe, will be restarted Warning Unhealthy 34s kubelet Readiness probe errored: rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task d796b71bb52dc8516ad4f1b347934703c842b488d855fde20918bd1021c7172a not found: not found Normal Created 33s (x2 over 73s) kubelet Created container registry-server Normal Started 33s (x2 over 73s) kubelet Started container registry-server Normal Pulled 33s kubelet Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 605.963458ms (605.976558ms including waiting) Warning Unhealthy 4s (x7 over 64s) kubelet Readiness probe failed: timeout: failed to connect service ":50051" within 1s Warning Unhealthy 4s (x5 over 54s) kubelet Liveness probe failed: timeout: failed to connect service ":50051" within 1s `

sybadm avatar Nov 17 '23 12:11 sybadm