network-observability-operator
network-observability-operator copied to clipboard
controller-manager pod crashloopback
Running on RKE2 + Kubevirt + CDI + Prometheus + Loki (Mono)
- Deployment
helm repo add netobserv https://netobserv.io/static/helm/ --force-update
helm install netobserv --create-namespace -n netobserv --set standaloneConsole.enable=true netobserv/netobserv-operator
- Error
# k get po -n netobserv
NAME READY STATUS RESTARTS AGE
netobserv-controller-manager-546bb84fb-ddn2k 0/1 CrashLoopBackOff 5 (62s ago) 4m20s
#k describe po netobserv-controller-manager-546bb84fb-ddn2k -n netobserv
Name: netobserv-controller-manager-546bb84fb-ddn2k
Namespace: netobserv
Priority: 0
Service Account: netobserv-controller-manager
Node: lenevo-ts-w2/192.168.0.119
Start Time: Thu, 01 May 2025 02:49:15 +0000
Labels: app=netobserv-operator
control-plane=controller-manager
pod-template-hash=546bb84fb
Annotations: cni.projectcalico.org/containerID: 1326c203864d9fa3db82d55e04c90f82cf71f3414d84944be36425680baafce5
cni.projectcalico.org/podIP: 10.244.1.30/32
cni.projectcalico.org/podIPs: 10.244.1.30/32
k8s.v1.cni.cncf.io/network-status:
[{
"name": "k8s-pod-network",
"ips": [
"10.244.1.30"
],
"default": true,
"dns": {}
}]
Status: Running
IP: 10.244.1.30
IPs:
IP: 10.244.1.30
Controlled By: ReplicaSet/netobserv-controller-manager-546bb84fb
Containers:
manager:
Container ID: containerd://0de483a18749525ca7105ab8b889f4bd2dbb432546236a06445fe90a60f7457a
Image: quay.io/netobserv/network-observability-operator:1.8.2-community
Image ID: quay.io/netobserv/network-observability-operator@sha256:ed1766e0ca5b94bdd4f645a5f5a38e31b92542b59da226cfeef3d9fc1ceffbac
Port: 9443/TCP
Host Port: 0/TCP
Command:
/manager
Args:
--health-probe-bind-address=:8081
--metrics-bind-address=:8443
--leader-elect
--ebpf-agent-image=$(RELATED_IMAGE_EBPF_AGENT)
--flowlogs-pipeline-image=$(RELATED_IMAGE_FLOWLOGS_PIPELINE)
--console-plugin-image=$(RELATED_IMAGE_CONSOLE_PLUGIN)
--downstream-deployment=$(DOWNSTREAM_DEPLOYMENT)
--profiling-bind-address=$(PROFILING_BIND_ADDRESS)
--metrics-cert-file=/etc/tls/private/tls.crt
--metrics-cert-key-file=/etc/tls/private/tls.key
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 01 May 2025 02:52:33 +0000
Finished: Thu, 01 May 2025 02:52:33 +0000
Ready: False
Restart Count: 5
Limits:
memory: 400Mi
Requests:
cpu: 100m
memory: 100Mi
Liveness: http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
Environment:
RELATED_IMAGE_EBPF_AGENT: quay.io/netobserv/netobserv-ebpf-agent:v1.8.2-community
RELATED_IMAGE_FLOWLOGS_PIPELINE: quay.io/netobserv/flowlogs-pipeline:v1.8.2-community
RELATED_IMAGE_CONSOLE_PLUGIN: quay.io/netobserv/network-observability-standalone-frontend:v1.8.2-community
DOWNSTREAM_DEPLOYMENT: false
PROFILING_BIND_ADDRESS:
Mounts:
/etc/tls/private from manager-metric-tls (ro)
/tmp/k8s-webhook-server/serving-certs from cert (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2842z (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
cert:
Type: Secret (a volume populated by a Secret)
SecretName: webhook-server-cert
Optional: false
manager-metric-tls:
Type: Secret (a volume populated by a Secret)
SecretName: manager-metrics-tls
Optional: false
kube-api-access-2842z:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m4s default-scheduler Successfully assigned netobserv/netobserv-controller-manager-546bb84fb-ddn2k to lenevo-ts-w2
Normal AddedInterface 4m3s multus Add eth0 [10.244.1.30/32] from k8s-pod-network
Normal Pulled 4m1s kubelet Successfully pulled image "quay.io/netobserv/network-observability-operator:1.8.2-community" in 1.101s (1.101s including waiting). Image size: 82112140 bytes.
Normal Pulled 4m kubelet Successfully pulled image "quay.io/netobserv/network-observability-operator:1.8.2-community" in 1.238s (1.238s including waiting). Image size: 82112140 bytes.
Normal Pulled 3m42s kubelet Successfully pulled image "quay.io/netobserv/network-observability-operator:1.8.2-community" in 1.013s (1.013s including waiting). Image size: 82112140 bytes.
Normal Pulling 3m11s (x4 over 4m2s) kubelet Pulling image "quay.io/netobserv/network-observability-operator:1.8.2-community"
Normal Created 3m10s (x4 over 4m1s) kubelet Created container: manager
Normal Started 3m10s (x4 over 4m1s) kubelet Started container manager
Normal Pulled 3m10s kubelet Successfully pulled image "quay.io/netobserv/network-observability-operator:1.8.2-community" in 1.01s (1.01s including waiting). Image size: 82112140 bytes.
Warning BackOff 3m3s (x9 over 3m59s) kubelet Back-off restarting failed container manager in pod netobserv-controller-manager-546bb84fb-ddn2k_netobserv(d7102a88-1061-41ed-8895-9681609215c7)
#k logs -f netobserv-controller-manager-546bb84fb-ddn2k -n netobserv
2025-05-01T02:52:33.530Z INFO setup Starting netobserv-operator [build version: main-ab3524e, build date: 2025-03-20 11:39]
2025-05-01T02:52:33.561Z INFO setup Initializing metrics certificate watcher using provided certificates {"metrics-cert-file": "/etc/tls/private/tls.crt", "metrics-cert-key-file": "/etc/tls/private/tls.key"}
2025-05-01T02:52:33.562Z INFO controller-runtime.certwatcher Updated current TLS certificate
2025-05-01T02:52:33.562Z INFO Creating manager
2025-05-01T02:52:33.563Z INFO Discovering APIs
2025-05-01T02:52:33.599Z ERROR setup unable to setup manager {"error": "can't collect cluster info: unable to retrieve the complete list of server APIs: upload.cdi.kubevirt.io/v1beta1: stale GroupVersion discovery: upload.cdi.kubevirt.io/v1beta1"}
main.main
/opt/app-root/main.go:190
runtime.main
/usr/local/go/src/runtime/proc.go:272
Hi @cloudcafetech , thanks for opening this issue. At a first glance it seems the root cause isn't in netobserv, but is related to something wrong with server API for kubevirt? Can you run
kubectl get apiservice
and check if there are non available APIs? (from AVAILABLE column) - I guess kubevirt would show False there. So, that would be something to fix in the first place.
Well, there's probably something we can improve on our side too. The error comes from the k8s discovery client that netobserv uses to get some cluster context. The error happens when the function ServerGroupsAndResources returns an error, however that function can return errors AND data together, which means netobserv could still get whatever didn't fail, and that might be sufficient to run it normally. So I guess we can do something here, for a future release.
But in the meantime, I'd suggest to investigate why kubevirt API would be unavailable.
Can you run
kubectl get apiservice
NAME SERVICE AVAILABLE AGE
v1. Local True 12d
v1.acme.cert-manager.io Local True 11d
v1.admissionregistration.k8s.io Local True 12d
v1.apiextensions.k8s.io Local True 12d
v1.apps Local True 12d
v1.authentication.k8s.io Local True 12d
v1.authorization.k8s.io Local True 12d
v1.autoscaling Local True 12d
v1.batch Local True 12d
v1.ceph.rook.io Local True 12d
v1.cert-manager.io Local True 11d
v1.certificates.k8s.io Local True 12d
v1.console.openshift.io Local True 12d
v1.coordination.k8s.io Local True 12d
v1.crd.projectcalico.org Local True 12d
v1.discovery.k8s.io Local True 12d
v1.events.k8s.io Local True 12d
v1.flowcontrol.apiserver.k8s.io Local True 12d
v1.helm.cattle.io Local True 12d
v1.k3s.cattle.io Local True 12d
v1.k8s.cni.cncf.io Local True 12d
v1.kubevirt.io Local True 12d
v1.monitoring.coreos.com Local True 12d
v1.networkaddonsoperator.network.kubevirt.io Local True 5d15h
v1.networking.k8s.io Local True 12d
v1.nmstate.io Local True 12d
v1.node.k8s.io Local True 12d
v1.operators.coreos.com Local True 12d
v1.packages.operators.coreos.com olm/packageserver-service True 12d
v1.policy Local True 12d
v1.rbac.authorization.k8s.io Local True 12d
v1.scheduling.k8s.io Local True 12d
v1.snapshot.storage.k8s.io Local True 12d
v1.storage.k8s.io Local True 12d
v1.subresources.kubevirt.io kubevirt/virt-api True 12d
v1.velero.io Local True 11d
v1alpha1.clone.kubevirt.io Local True 12d
v1alpha1.console.openshift.io Local True 12d
v1alpha1.export.kubevirt.io Local True 12d
v1alpha1.instancetype.kubevirt.io Local True 12d
v1alpha1.k8s.cni.cncf.io Local True 5d15h
v1alpha1.migrations.kubevirt.io Local True 12d
v1alpha1.monitoring.coreos.com Local True 12d
v1alpha1.networkaddonsoperator.network.kubevirt.io Local True 5d15h
v1alpha1.nmstate.io Local True 12d
v1alpha1.objectbucket.io Local True 12d
v1alpha1.operators.coreos.com Local True 12d
v1alpha1.policy.networking.k8s.io Local True 12d
v1alpha1.pool.kubevirt.io Local True 12d
v1alpha1.snapshot.kubevirt.io Local True 12d
v1alpha1.whereabouts.cni.cncf.io Local True 12d
v1alpha2.instancetype.kubevirt.io Local True 12d
v1alpha2.operators.coreos.com Local True 12d
v1alpha3.kubevirt.io Local True 12d
v1alpha3.subresources.kubevirt.io kubevirt/virt-api True 12d
v1beta1.cdi.kubevirt.io Local True 12d
v1beta1.clone.kubevirt.io Local True 12d
v1beta1.export.kubevirt.io Local True 12d
v1beta1.forklift.cdi.kubevirt.io Local True 6d9h
v1beta1.forklift.konveyor.io Local True 6d9h
v1beta1.instancetype.kubevirt.io Local True 12d
v1beta1.metallb.io Local True 12d
v1beta1.metrics.k8s.io kube-system/rke2-metrics-server True 12d
v1beta1.nmstate.io Local True 12d
v1beta1.snapshot.kubevirt.io Local True 12d
v1beta1.upload.cdi.kubevirt.io cdi/cdi-api False (FailedDiscoveryCheck) 12d
v1beta2.metallb.io Local True 12d
v1beta3.flowcontrol.apiserver.k8s.io Local True 12d
v2.autoscaling Local True 12d
v2.operators.coreos.com Local True 12d
v2alpha1.velero.io Local True 11d
Anything?
Note: I as mentioned. running on RKE2 + Kubevirt + CDI + Prometheus + Loki (Mono)
@cloudcafetech it confirms that there's an issue with kubevirt. The root cause isn't in netobserv. It sounds similar to what is described here: https://github.com/kubevirt/containerized-data-importer/issues/1295 . That would be something to check with kubevirt experts: why is this API is unavailable - I can't tell. I'm working on a fix on netobserv side to not fail in case of APIs having such issues, should be in the next release.
(you might get more troubleshooting info by running kubectl get apiservice v1beta1.upload.cdi.kubevirt.io -oyaml and look at the status)