gloo icon indicating copy to clipboard operation
gloo copied to clipboard

Increased timeouts for proxies check in glooctl

Open tkukushkin opened this issue 6 months ago • 2 comments

Description

  • Added more attempts to make ProxyEndpointRequest in glooctl check when checking proxies.
  • Increased timeout on getting metrics from proxies in glooctl check when checking proxies.

Context

In our environment we run glooctl check every 5 minutes and pretty often we get errors with connection problems like:

* rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:51350: connect: connection refused"
...
* timed out trying to connect to localhost during port-forward, errors: 8 errors occurred:
* Get "http://localhost:51666/stats/prometheus": dial tcp [::1]:51666: connect: connection refused
...

About first error. Debugging glooctl check locally I found out that sometimes port-forward starts to really work after more than 1 second. But request to gloo is made right after starting port-forward. It has 5 retries, and sometimes it is not enough for our environment. And as glooctl check makes this request for all watched namespaces (we have many) and port-forward is created on every request, chance of problems increases.

About second error. We get it not so often, maybe once a day, but it still annoys. We have around 240_000 metrics on our proxies and probably 30 seconds timeout is not always enough.

Testing steps

I manually tested glooctl check in my environmnent. Everything works fine.

Checklist:

  • [x] I have performed a self-review of my own code
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have made corresponding changes to the documentation
  • [x] I have added tests that prove my fix is effective or that my feature works

tkukushkin avatar Aug 28 '24 14:08 tkukushkin