Connection to Hubble during flow validation is unreliable (i.e. `Timeout waiting for flow listener to become ready`)
In multiple tests, we've observed that the connection between the Cilium CLI and Hubble is unreliable. This typically looks as follows:
📄 Following flows...
[.] Action [no-policies/client-to-client/ping-1: cilium-test/client2-5998d566b4-xsbdg (10.84.0.195) -> cilium-test/client-6488dcf5d4-wbs57 (10.84.0.66:0)]
🟥 Timeout waiting for flow listener to become ready
[=] Test [allow-all-except-world]
🟥 Receiving flows from Hubble Relay: hubble server status failure: context canceled
It's not clear why the flow listener times out. There are multiple potential explanations which may apply:
- The port-forward between the Cilium CLI and Hubble Relay is unreliable. This has been reported previously on AKS (for any port forward). For CI purposes, we could potentially rely on a service type LoadBalancer instead, which might be more stable. There are authentication concerns however, since anyone could access type LoadBalancer services.
- The connection between Hubble Relay and the Hubble Observers (inside cilium-agent) is unreliable, this could indicate a pod2host connectivity issue. This needs to be investigated as it likely would not only affect Hubble.
- No reconnection attempt is is made. The Cilium CLI should ideally try to reconnect to deal with ephemeral connectivity issues between it and the cluster.
Related https://github.com/cilium/cilium-cli/issues/1202
We may wish to consider passing --debug to cilium connectivity in all the CI tests.
I've observed this as well when running the cilium connectivity test command from my laptop to clusters running in EKS.
we've added retries to connect to hubble relay, but it still fails sometimes:
2023-06-06T14:31:20.206828192Z 🐛 hubble relay server status failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4245: connect: connection refused"
2023-06-06T14:31:21.207331072Z 🐛 retrying hubble relay server status request
2023-06-06T14:31:21.208022384Z 🐛 hubble relay server status failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4245: connect: connection refused"
2023-06-06T14:31:22.208211059Z 🐛 retrying hubble relay server status request
2023-06-06T14:31:22.208255260Z 🐛 hubble relay server status failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4245: connect: connection refused"
2023-06-06T14:31:23.195803817Z 🟥 Timeout waiting for flow listener to become ready
2023-06-06T14:31:23.195909718Z 🟥 Receiving flows from Hubble Relay: hubble server status failure: context canceled
https://github.com/cilium/cilium-cli/actions/runs/5188629513/jobs/9354654785
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has not seen any activity since it was marked stale. Closing.