Web hook pod (frr-k8s-webhook-server) is restarting at least 3 times before healthy
When running the E2E to check the frr-k8s https://github.com/metallb/metallb-operator/blob/main/test/e2e/functional/tests/e2e.go#L282
Test is green but the pod is restarting before it becomes healthy/ready
-n metallb-system get pods -l component=frr-k8s-webhook-server -o wide -w
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
frr-k8s-webhook-server-6ffd7bc857-cwcsv 0/1 CrashLoopBackOff 2 (6s ago) 29s 10.244.2.7 kind-worker <none> <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv 0/1 Running 3 (22s ago) 45s 10.244.2.7 kind-worker <none> <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv 1/1 Running 3 (38s ago) 61s 10.244.2.7 kind-worker <none> <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv 1/1 Terminating 3 (69s ago) 92s 10.244.2.7 kind-worker <none> <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv 0/1 Terminating 3 (70s ago) 93s <none> kind-worker <none> <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv 0/1 Terminating 3 (70s ago) 93s 10.244.2.7 kind-worker <none> <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv 0/1 Terminating 3 (70s ago) 93s 10.244.2.7 kind-worker <none> <none>
frr-k8s-webhook-server-6ffd7bc857-cwcsv 0/1 Terminating 3 (70s ago) 93s 10.244.2.7 kind-worker <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 Pending 0 0s <none> <none> <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 Pending 0 0s <none> kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 ContainerCreating 0 0s <none> kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 Running 0 1s 10.244.1.5 kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 Completed 0 2s 10.244.1.5 kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 Running 1 (2s ago) 3s 10.244.1.5 kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 Error 1 (4s ago) 5s 10.244.1.5 kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 CrashLoopBackOff 1 (6s ago) 10s 10.244.1.5 kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 Running 2 (20s ago) 24s 10.244.1.5 kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 Completed 2 (21s ago) 25s 10.244.1.5 kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 CrashLoopBackOff 2 (2s ago) 26s 10.244.1.5 kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 0/1 Running 3 (33s ago) 57s 10.244.1.5 kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 1/1 Running 3 (46s ago) 70s 10.244.1.5 kind-worker2 <none> <none>
frr-k8s-webhook-server-6ffd7bc857-phvvf 1/1 Terminating 3 (78s ago) 102s 10.244.1.5 kind-worker2 <none> <none>
with latest 4.16 Metallb we have an ImagePullBackOff on controller, speakerandfrr-k8s` pods pods are partially deployed with 1/2, 4/6 containers ready. It might be a related issue, but the outcomes are worse.
oc get csv
NAME DISPLAY VERSION REPLACES PHASE
ingress-node-firewall.v4.16.0-202409051837 Ingress Node Firewall Operator 4.16.0-202409051837 ingress-node-firewall.v4.16.0-202410011135 Succeeded
metallb-operator.v4.16.0-202410292005 MetalLB Operator 4.16.0-202410292005 metallb-operator.v4.16.0-202410251707 Succeeded
webhook pod shows up TLS handshake errors in logs
(*runnableGroup).reconcile.func1\n\t/metallb/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223"}
2024/10/30 05:54:02 http: TLS handshake error from 10.130.0.41:48190: remote error: tls: bad certificate
2024/10/30 05:54:03 http: TLS handshake error from 10.130.0.41:48200: remote error: tls: bad certificate
2024/10/30 05:54:05 http: TLS handshake error from 10.130.0.41:48206: remote error: tls: bad certificate
2024/10/30 05:54:05 http: TLS handshake error from 10.130.0.41:48208: remote error: tls: bad certificate
2024/10/30 05:54:06 http: TLS handshake error from 10.130.0.41:48218: remote error: tls: bad certificate
2024/10/30 05:54:08 http: TLS handshake error from 10.130.0.41:58904: remote error: tls: bad certificate
2024/10/30 05:54:08 http: TLS handshake error from 10.130.0.41:58916: remote error: tls: bad certificate
2024/10/30 05:54:09 http: TLS handshake error from 10.130.0.41:58918: remote error: tls: bad certificate
2024/10/30 05:54:11 http: TLS handshake error from 10.130.0.41:58928: remote error: tls: bad certificate
2024/10/30 05:54:14 http: TLS handshake error from 10.130.0.41:58940: remote error: tls: bad certificate
2024/10/30 05:54:15 http: TLS handshake error from 10.130.0.41:58954: remote error: tls: bad certificate
2024/10/30 05:54:17 http: TLS handshake error from 10.130.0.41:58964: remote error: tls: bad certificate
2024/10/30 05:54:17 http: TLS handshake error from 10.130.0.41:58978: remote error: tls: bad certificate
2024/10/30 05:54:18 http: TLS handshake error from 10.130.0.41:44000: remote error: tls: bad certificate
2024/10/30 05:54:20 http: TLS handshake error from 10.130.0.41:44014: remote error: tls: bad certificate
2024/10/30 05:54:23 http: TLS handshake error from 10.130.0.41:44024: remote error: tls: bad certificate
2024/10/30 05:54:24 http: TLS handshake error from 10.130.0.41:44038: remote error: tls: bad certificate
2024/10/30 05:54:26 http: TLS handshake error from 10.130.0.41:44052: remote error: tls: bad certificate
Hosting cluster lacks these services:
frr-k8s-monitor-service
frr-k8s-webhook-service
Hosted kubevirt clusters fail to pull images and deploy operators showing up DeadlineExceeded error
Also another cluster that uses latest 4.17 version, has the same ImagePullBackOff errors on controller, speaker and frr-k8s but it seems to be working as expected.
cc get csv
NAME DISPLAY VERSION REPLACES PHASE
ingress-node-firewall.v4.17.0-202410011205 Ingress Node Firewall Operator 4.17.0-202410011205 ingress-node-firewall.v4.17.0-202410211206 Succeeded
metallb-operator.v4.17.0-202410241236 MetalLB Operator 4.17.0-202410241236
@DanielOsypenko there's no 4.16 version. This is the community version of the operator. If this is happening on openshift I suggest following up on Red Hat channels.