cilium-cli icon indicating copy to clipboard operation
cilium-cli copied to clipboard

DaemonSet Unavailable on AWS deployment of Openshift.

Open owainow opened this issue 2 years ago • 9 comments

Getting the following issue when trying to install Cilium on an AWS deployment of Openshift V4.6. I am able to install cilium through the operator hub without problem however when I run the install command on the command line "cilium install --cluster-name=x" I consistently run into this error. It seems the cilium-agent containers are not able to be deployed. I cant find any reference to this issue in the docs.

:hourglass: Waiting for Cilium to be installed...
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         6 errors
 \__/¯¯\__/    Operator:       OK
 /¯¯\__/¯¯\    Hubble:         disabled
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

DaemonSet         cilium             Desired: 5, Unavailable: 5/5
Deployment        cilium-operator    Desired: 1, Ready: 1/1, Available: 1/1
Containers:       cilium             Running: 5
                  cilium-operator    Running: 1
Image versions    cilium             quay.io/cilium/cilium:v1.10.2: 5
                  cilium-operator    quay.io/cilium/operator-generic:v1.10.2: 1
Errors:           cilium             cilium          5 pods of DaemonSet cilium are not ready
                  cilium             cilium-49x5q    unable to retrieve cilium status: unable to upgrade connection: container not found ("cilium-agent")
                  cilium             cilium-fnr5t    unable to retrieve cilium status: unable to upgrade connection: container not found ("cilium-agent")
                  cilium             cilium-mg2kp    unable to retrieve cilium status: unable to upgrade connection: container not found ("cilium-agent")
                  cilium             cilium-s27px    unable to retrieve cilium status: unable to upgrade connection: container not found ("cilium-agent")
                  cilium             cilium-zntcq    unable to retrieve cilium status: unable to upgrade connection: container not found ("cilium-agent")

:leftwards_arrow_with_hook: Rolling back installation...

Error: Unable to install Cilium: timeout while waiting for status to become successful: context deadline exceeded

owainow avatar Aug 09 '21 14:08 owainow

@owainow Could you collect a Cilium sysdump for that cluster? It's hard to help otherwise as cilium status doesn't report in-depth information.

pchaigno avatar Aug 09 '21 15:08 pchaigno

Sure let me attach it. New to Cilium so if I've not included any info etc let me know. cilium-sysdump-20210809-164914.zip

owainow avatar Aug 09 '21 15:08 owainow

There seem to be an issue with pulling the image for one of the operator pods:

"state": {
    "waiting": {
        "message": "Back-off pulling image \"quay.io/cilium/operator-generic:v1.10.2@sha256:a88b04cb5895610620da6e90d362af9e512d2baa51a0a0d77ab34186dfb20c68\"",
        "reason": "ImagePullBackOff"
    }
}

There are also a couple errors in agents:

2021-08-09T14:42:44.922025191Z level=error msg="ListenAndServe failed for service health server, since the user might be running with kube-proxy. Please ensure that '--enable-health-check-nodeport' option is set to false if '--kube-proxy-replacement' is set to 'partial'" error="listen tcp :32313: bind: address already in use" serviceName=router-default serviceNamespace=openshift-ingress subsys=service-healthserver svcHealthCheckNodePort=32313
2021-08-09T14:42:44.922093099Z level=error msg="ListenAndServe failed for service health server" error="listen tcp :32313: bind: address already in use" serviceName=router-default serviceNamespace=openshift-ingress subsys=service-healthserver svcHealthCheckNodePort=32313

I don't expect those issues would cause the errors you are seeing however. I didn't find anything else in the sysdump. Were the errors still visible in cilium status after you retrieved the Cilium sysdump?

pchaigno avatar Aug 09 '21 21:08 pchaigno

Yes, after getting the syslog if I run cilium status shows a daemon error again. I have tried again on a different cluster but the problem is consistent. Unsure why because OCP is able to "Validate" the quay image. `[owain@localhost ~]$ cilium status /¯¯
/¯¯_/¯¯\ Cilium: 1 errors _/¯¯_/ Operator: disabled /¯¯_/¯¯\ Hubble: disabled _/¯¯_/ ClusterMesh: disabled __/

Containers: cilium
cilium-operator
Errors: cilium cilium daemonsets.apps "cilium" not found `

image

owainow avatar Aug 10 '21 13:08 owainow

Hi, Any updates on this?

owainow avatar Aug 18 '21 09:08 owainow

Can anyone help point us in a direction here? The issue seems to still exist.

v1k0d3n avatar Mar 02 '22 13:03 v1k0d3n

@v1k0d3n when I deployed cilium via OLM on OpenShift (baremetal) and I had to manually add the service accounts from cilium to the privileged SCC, but the namespace was flooded with events related to the policy issues oc get events. I never got it fully functioning on OpenShift though due to some issues with hubble which I posted in the cilium hubble repo.

Cilium CLI insists the DS does not exist and other components are not configured but they do and are. Perhaps the cli doesn't work with OLM installations. Under supported environments in the readme it doesn't specifically say OpenShift so I'm left with the assumption it is unsupported.

Supported Environments
 minikube
 kind
 EKS
 self-managed
 GKE
 AKS
 k3s
 Rancher
$ cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         1 errors
 \__/¯¯\__/    Operator:       disabled
 /¯¯\__/¯¯\    Hubble:         disabled
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

Containers:      cilium
                 cilium-operator
Cluster Pods:    0/446 managed by Cilium
Errors:          cilium    cilium    daemonsets.apps "cilium" not found

oc get ds
NAME     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
cilium   12        12        12      12           12          <none>          6h54m

$ oc get pods
NAME                               READY   STATUS    RESTARTS   AGE
cilium-2vsgk                       1/1     Running   0          3h43m
cilium-75sl2                       1/1     Running   0          3h43m
cilium-7g92r                       1/1     Running   0          3h43m
cilium-b8zc5                       1/1     Running   0          3h43m
cilium-dcvv4                       1/1     Running   0          3h43m
cilium-gs7f6                       1/1     Running   0          3h43m
cilium-kqqdc                       1/1     Running   0          3h43m
cilium-kvq27                       1/1     Running   0          3h43m
cilium-olm-56b8648b4f-v8mcj        1/1     Running   0          3h43m
cilium-operator-55c9dd779d-grxcc   1/1     Running   0          3h43m
cilium-operator-55c9dd779d-kgl68   1/1     Running   0          3h43m
cilium-ptk27                       1/1     Running   0          3h43m
cilium-v2p4q                       1/1     Running   0          3h43m
cilium-wpl26                       1/1     Running   0          3h43m
cilium-znggn                       1/1     Running   0          3h43m
hubble-relay-6584f5545c-99p9n      1/1     Running   0          3h43m
hubble-ui-95d74d44c-cqsqx          3/3     Running   0          3h43m

ctml91 avatar Mar 08 '22 02:03 ctml91

It's probably because cilium isn't installed in the default namespace. It's necessary to provide that namespace to the CLI:

E.g.

cilium status --namespace=cilium

jotak avatar Sep 30 '22 16:09 jotak

Yes, this is the reason. We can close this issue now.

nickolaev avatar Apr 04 '24 15:04 nickolaev