network-observability-operator icon indicating copy to clipboard operation
network-observability-operator copied to clipboard

controller-manager pod crashloopback

Open cloudcafetech opened this issue 6 months ago • 5 comments

Running on RKE2 + Kubevirt + CDI + Prometheus + Loki (Mono)

  • Deployment
helm repo add netobserv https://netobserv.io/static/helm/ --force-update
helm install netobserv --create-namespace -n netobserv --set standaloneConsole.enable=true netobserv/netobserv-operator
  • Error
# k get po -n netobserv
NAME                                           READY   STATUS             RESTARTS      AGE
netobserv-controller-manager-546bb84fb-ddn2k   0/1     CrashLoopBackOff   5 (62s ago)   4m20s

#k describe po netobserv-controller-manager-546bb84fb-ddn2k -n netobserv
Name:             netobserv-controller-manager-546bb84fb-ddn2k
Namespace:        netobserv
Priority:         0
Service Account:  netobserv-controller-manager
Node:             lenevo-ts-w2/192.168.0.119
Start Time:       Thu, 01 May 2025 02:49:15 +0000
Labels:           app=netobserv-operator
                  control-plane=controller-manager
                  pod-template-hash=546bb84fb
Annotations:      cni.projectcalico.org/containerID: 1326c203864d9fa3db82d55e04c90f82cf71f3414d84944be36425680baafce5
                  cni.projectcalico.org/podIP: 10.244.1.30/32
                  cni.projectcalico.org/podIPs: 10.244.1.30/32
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "k8s-pod-network",
                        "ips": [
                            "10.244.1.30"
                        ],
                        "default": true,
                        "dns": {}
                    }]
Status:           Running
IP:               10.244.1.30
IPs:
  IP:           10.244.1.30
Controlled By:  ReplicaSet/netobserv-controller-manager-546bb84fb
Containers:
  manager:
    Container ID:  containerd://0de483a18749525ca7105ab8b889f4bd2dbb432546236a06445fe90a60f7457a
    Image:         quay.io/netobserv/network-observability-operator:1.8.2-community
    Image ID:      quay.io/netobserv/network-observability-operator@sha256:ed1766e0ca5b94bdd4f645a5f5a38e31b92542b59da226cfeef3d9fc1ceffbac
    Port:          9443/TCP
    Host Port:     0/TCP
    Command:
      /manager
    Args:
      --health-probe-bind-address=:8081
      --metrics-bind-address=:8443
      --leader-elect
      --ebpf-agent-image=$(RELATED_IMAGE_EBPF_AGENT)
      --flowlogs-pipeline-image=$(RELATED_IMAGE_FLOWLOGS_PIPELINE)
      --console-plugin-image=$(RELATED_IMAGE_CONSOLE_PLUGIN)
      --downstream-deployment=$(DOWNSTREAM_DEPLOYMENT)
      --profiling-bind-address=$(PROFILING_BIND_ADDRESS)
      --metrics-cert-file=/etc/tls/private/tls.crt
      --metrics-cert-key-file=/etc/tls/private/tls.key
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 01 May 2025 02:52:33 +0000
      Finished:     Thu, 01 May 2025 02:52:33 +0000
    Ready:          False
    Restart Count:  5
    Limits:
      memory:  400Mi
    Requests:
      cpu:      100m
      memory:   100Mi
    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RELATED_IMAGE_EBPF_AGENT:         quay.io/netobserv/netobserv-ebpf-agent:v1.8.2-community
      RELATED_IMAGE_FLOWLOGS_PIPELINE:  quay.io/netobserv/flowlogs-pipeline:v1.8.2-community
      RELATED_IMAGE_CONSOLE_PLUGIN:     quay.io/netobserv/network-observability-standalone-frontend:v1.8.2-community
      DOWNSTREAM_DEPLOYMENT:            false
      PROFILING_BIND_ADDRESS:
    Mounts:
      /etc/tls/private from manager-metric-tls (ro)
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2842z (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webhook-server-cert
    Optional:    false
  manager-metric-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  manager-metrics-tls
    Optional:    false
  kube-api-access-2842z:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       4m4s                  default-scheduler  Successfully assigned netobserv/netobserv-controller-manager-546bb84fb-ddn2k to lenevo-ts-w2
  Normal   AddedInterface  4m3s                  multus             Add eth0 [10.244.1.30/32] from k8s-pod-network
  Normal   Pulled          4m1s                  kubelet            Successfully pulled image "quay.io/netobserv/network-observability-operator:1.8.2-community" in 1.101s (1.101s including waiting). Image size: 82112140 bytes.
  Normal   Pulled          4m                    kubelet            Successfully pulled image "quay.io/netobserv/network-observability-operator:1.8.2-community" in 1.238s (1.238s including waiting). Image size: 82112140 bytes.
  Normal   Pulled          3m42s                 kubelet            Successfully pulled image "quay.io/netobserv/network-observability-operator:1.8.2-community" in 1.013s (1.013s including waiting). Image size: 82112140 bytes.
  Normal   Pulling         3m11s (x4 over 4m2s)  kubelet            Pulling image "quay.io/netobserv/network-observability-operator:1.8.2-community"
  Normal   Created         3m10s (x4 over 4m1s)  kubelet            Created container: manager
  Normal   Started         3m10s (x4 over 4m1s)  kubelet            Started container manager
  Normal   Pulled          3m10s                 kubelet            Successfully pulled image "quay.io/netobserv/network-observability-operator:1.8.2-community" in 1.01s (1.01s including waiting). Image size: 82112140 bytes.
  Warning  BackOff         3m3s (x9 over 3m59s)  kubelet            Back-off restarting failed container manager in pod netobserv-controller-manager-546bb84fb-ddn2k_netobserv(d7102a88-1061-41ed-8895-9681609215c7)

#k logs -f netobserv-controller-manager-546bb84fb-ddn2k -n netobserv
2025-05-01T02:52:33.530Z        INFO    setup   Starting netobserv-operator [build version: main-ab3524e, build date: 2025-03-20 11:39]
2025-05-01T02:52:33.561Z        INFO    setup   Initializing metrics certificate watcher using provided certificates    {"metrics-cert-file": "/etc/tls/private/tls.crt", "metrics-cert-key-file": "/etc/tls/private/tls.key"}
2025-05-01T02:52:33.562Z        INFO    controller-runtime.certwatcher  Updated current TLS certificate
2025-05-01T02:52:33.562Z        INFO    Creating manager
2025-05-01T02:52:33.563Z        INFO    Discovering APIs
2025-05-01T02:52:33.599Z        ERROR   setup   unable to setup manager {"error": "can't collect cluster info: unable to retrieve the complete list of server APIs: upload.cdi.kubevirt.io/v1beta1: stale GroupVersion discovery: upload.cdi.kubevirt.io/v1beta1"}
main.main
        /opt/app-root/main.go:190
runtime.main
        /usr/local/go/src/runtime/proc.go:272

cloudcafetech avatar May 01 '25 03:05 cloudcafetech

Hi @cloudcafetech , thanks for opening this issue. At a first glance it seems the root cause isn't in netobserv, but is related to something wrong with server API for kubevirt? Can you run

kubectl get apiservice

and check if there are non available APIs? (from AVAILABLE column) - I guess kubevirt would show False there. So, that would be something to fix in the first place.

jotak avatar May 02 '25 12:05 jotak

Well, there's probably something we can improve on our side too. The error comes from the k8s discovery client that netobserv uses to get some cluster context. The error happens when the function ServerGroupsAndResources returns an error, however that function can return errors AND data together, which means netobserv could still get whatever didn't fail, and that might be sufficient to run it normally. So I guess we can do something here, for a future release.

But in the meantime, I'd suggest to investigate why kubevirt API would be unavailable.

jotak avatar May 02 '25 12:05 jotak

Can you run

kubectl get apiservice
NAME                                                 SERVICE                           AVAILABLE                      AGE
v1.                                                  Local                             True                           12d
v1.acme.cert-manager.io                              Local                             True                           11d
v1.admissionregistration.k8s.io                      Local                             True                           12d
v1.apiextensions.k8s.io                              Local                             True                           12d
v1.apps                                              Local                             True                           12d
v1.authentication.k8s.io                             Local                             True                           12d
v1.authorization.k8s.io                              Local                             True                           12d
v1.autoscaling                                       Local                             True                           12d
v1.batch                                             Local                             True                           12d
v1.ceph.rook.io                                      Local                             True                           12d
v1.cert-manager.io                                   Local                             True                           11d
v1.certificates.k8s.io                               Local                             True                           12d
v1.console.openshift.io                              Local                             True                           12d
v1.coordination.k8s.io                               Local                             True                           12d
v1.crd.projectcalico.org                             Local                             True                           12d
v1.discovery.k8s.io                                  Local                             True                           12d
v1.events.k8s.io                                     Local                             True                           12d
v1.flowcontrol.apiserver.k8s.io                      Local                             True                           12d
v1.helm.cattle.io                                    Local                             True                           12d
v1.k3s.cattle.io                                     Local                             True                           12d
v1.k8s.cni.cncf.io                                   Local                             True                           12d
v1.kubevirt.io                                       Local                             True                           12d
v1.monitoring.coreos.com                             Local                             True                           12d
v1.networkaddonsoperator.network.kubevirt.io         Local                             True                           5d15h
v1.networking.k8s.io                                 Local                             True                           12d
v1.nmstate.io                                        Local                             True                           12d
v1.node.k8s.io                                       Local                             True                           12d
v1.operators.coreos.com                              Local                             True                           12d
v1.packages.operators.coreos.com                     olm/packageserver-service         True                           12d
v1.policy                                            Local                             True                           12d
v1.rbac.authorization.k8s.io                         Local                             True                           12d
v1.scheduling.k8s.io                                 Local                             True                           12d
v1.snapshot.storage.k8s.io                           Local                             True                           12d
v1.storage.k8s.io                                    Local                             True                           12d
v1.subresources.kubevirt.io                          kubevirt/virt-api                 True                           12d
v1.velero.io                                         Local                             True                           11d
v1alpha1.clone.kubevirt.io                           Local                             True                           12d
v1alpha1.console.openshift.io                        Local                             True                           12d
v1alpha1.export.kubevirt.io                          Local                             True                           12d
v1alpha1.instancetype.kubevirt.io                    Local                             True                           12d
v1alpha1.k8s.cni.cncf.io                             Local                             True                           5d15h
v1alpha1.migrations.kubevirt.io                      Local                             True                           12d
v1alpha1.monitoring.coreos.com                       Local                             True                           12d
v1alpha1.networkaddonsoperator.network.kubevirt.io   Local                             True                           5d15h
v1alpha1.nmstate.io                                  Local                             True                           12d
v1alpha1.objectbucket.io                             Local                             True                           12d
v1alpha1.operators.coreos.com                        Local                             True                           12d
v1alpha1.policy.networking.k8s.io                    Local                             True                           12d
v1alpha1.pool.kubevirt.io                            Local                             True                           12d
v1alpha1.snapshot.kubevirt.io                        Local                             True                           12d
v1alpha1.whereabouts.cni.cncf.io                     Local                             True                           12d
v1alpha2.instancetype.kubevirt.io                    Local                             True                           12d
v1alpha2.operators.coreos.com                        Local                             True                           12d
v1alpha3.kubevirt.io                                 Local                             True                           12d
v1alpha3.subresources.kubevirt.io                    kubevirt/virt-api                 True                           12d
v1beta1.cdi.kubevirt.io                              Local                             True                           12d
v1beta1.clone.kubevirt.io                            Local                             True                           12d
v1beta1.export.kubevirt.io                           Local                             True                           12d
v1beta1.forklift.cdi.kubevirt.io                     Local                             True                           6d9h
v1beta1.forklift.konveyor.io                         Local                             True                           6d9h
v1beta1.instancetype.kubevirt.io                     Local                             True                           12d
v1beta1.metallb.io                                   Local                             True                           12d
v1beta1.metrics.k8s.io                               kube-system/rke2-metrics-server   True                           12d
v1beta1.nmstate.io                                   Local                             True                           12d
v1beta1.snapshot.kubevirt.io                         Local                             True                           12d
v1beta1.upload.cdi.kubevirt.io                       cdi/cdi-api                       False (FailedDiscoveryCheck)   12d
v1beta2.metallb.io                                   Local                             True                           12d
v1beta3.flowcontrol.apiserver.k8s.io                 Local                             True                           12d
v2.autoscaling                                       Local                             True                           12d
v2.operators.coreos.com                              Local                             True                           12d
v2alpha1.velero.io                                   Local                             True                           11d

Anything?

Note: I as mentioned. running on RKE2 + Kubevirt + CDI + Prometheus + Loki (Mono)

cloudcafetech avatar May 02 '25 16:05 cloudcafetech

@cloudcafetech it confirms that there's an issue with kubevirt. The root cause isn't in netobserv. It sounds similar to what is described here: https://github.com/kubevirt/containerized-data-importer/issues/1295 . That would be something to check with kubevirt experts: why is this API is unavailable - I can't tell. I'm working on a fix on netobserv side to not fail in case of APIs having such issues, should be in the next release.

jotak avatar May 13 '25 17:05 jotak

(you might get more troubleshooting info by running kubectl get apiservice v1beta1.upload.cdi.kubevirt.io -oyaml and look at the status)

jotak avatar May 13 '25 17:05 jotak