cilium-cli icon indicating copy to clipboard operation
cilium-cli copied to clipboard

Connection to Hubble during flow validation is unreliable (i.e. `Timeout waiting for flow listener to become ready`)

Open gandro opened this issue 3 years ago • 3 comments

In multiple tests, we've observed that the connection between the Cilium CLI and Hubble is unreliable. This typically looks as follows:

 📄 Following flows...
  [.] Action [no-policies/client-to-client/ping-1: cilium-test/client2-5998d566b4-xsbdg (10.84.0.195) -> cilium-test/client-6488dcf5d4-wbs57 (10.84.0.66:0)]
  🟥 Timeout waiting for flow listener to become ready

[=] Test [allow-all-except-world]
  🟥 Receiving flows from Hubble Relay: hubble server status failure: context canceled

It's not clear why the flow listener times out. There are multiple potential explanations which may apply:

  • The port-forward between the Cilium CLI and Hubble Relay is unreliable. This has been reported previously on AKS (for any port forward). For CI purposes, we could potentially rely on a service type LoadBalancer instead, which might be more stable. There are authentication concerns however, since anyone could access type LoadBalancer services.
  • The connection between Hubble Relay and the Hubble Observers (inside cilium-agent) is unreliable, this could indicate a pod2host connectivity issue. This needs to be investigated as it likely would not only affect Hubble.
  • No reconnection attempt is is made. The Cilium CLI should ideally try to reconnect to deal with ephemeral connectivity issues between it and the cluster.

Related https://github.com/cilium/cilium-cli/issues/1202

gandro avatar Nov 09 '22 17:11 gandro

We may wish to consider passing --debug to cilium connectivity in all the CI tests.

squeed avatar Nov 16 '22 22:11 squeed

I've observed this as well when running the cilium connectivity test command from my laptop to clusters running in EKS.

soggiest avatar Nov 23 '22 00:11 soggiest

we've added retries to connect to hubble relay, but it still fails sometimes:

2023-06-06T14:31:20.206828192Z   🐛 hubble relay server status failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4245: connect: connection refused"
2023-06-06T14:31:21.207331072Z   🐛 retrying hubble relay server status request
2023-06-06T14:31:21.208022384Z   🐛 hubble relay server status failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4245: connect: connection refused"
2023-06-06T14:31:22.208211059Z   🐛 retrying hubble relay server status request
2023-06-06T14:31:22.208255260Z   🐛 hubble relay server status failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4245: connect: connection refused"
2023-06-06T14:31:23.195803817Z   🟥 Timeout waiting for flow listener to become ready
2023-06-06T14:31:23.195909718Z   🟥 Receiving flows from Hubble Relay: hubble server status failure: context canceled

https://github.com/cilium/cilium-cli/actions/runs/5188629513/jobs/9354654785

michi-covalent avatar Jun 06 '23 14:06 michi-covalent

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] avatar Sep 28 '24 02:09 github-actions[bot]

This issue has not seen any activity since it was marked stale. Closing.

github-actions[bot] avatar Oct 14 '24 02:10 github-actions[bot]