gloo icon indicating copy to clipboard operation
gloo copied to clipboard

Failure when integrating with Consul instance with a high number of services and endpoints

Open bdecoste opened this issue 2 years ago • 5 comments

Gloo Edge Version

1.11.x (latest stable)

Kubernetes Version

No response

Describe the bug

gloo throws error when integrating with a Consul instance with a high number of services (e.g. 5000) and endpoints (e.g. 75000):

{"level":"error","ts":"2022-06-17T23:21:39.701Z","logger":"gloo.v1.event_loop.setup.v1.event_loop.syncer.consul_eds","caller":"consul/eds.go:183","msg":"write error channel is full! could not propagate err: Get \"http://172.17.0.1:8500/v1/catalog/service/raas-xiaowei-name-test-redis?consistent=&dc=sitetest3\": context canceled","version":"1.12.0-beta17","stacktrace":"[github.com/solo-io/gloo/projects/gloo/pkg/plugins/consul.refreshSpecs](http://github.com/solo-io/gloo/projects/gloo/pkg/plugins/consul.refreshSpecs)\n\t/workspace/gloo/projects/gloo/pkg/plugins/consul/eds.go:183\[ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/consul.(*plugin).WatchEndpoints.func2](http://ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/consul.(*plugin).WatchEndpoints.func2)\n\t/workspace/gloo/projects/gloo/pkg/plugins/consul/eds.go:99"}

Steps to reproduce the bug

  1. Deploy Consul with high number of services and endpoints
  2. Disable discovery
  3. Configure GE for Consul-based eDS

Expected Behavior

Handle high service& endpoint processing

Additional Context

No response

bdecoste avatar Jun 21 '22 20:06 bdecoste

I suspect updates off the service meta channel https://github.com/solo-io/gloo/blob/master/projects/gloo/pkg/plugins/consul/eds.go#L92-L93 are clobbering ongoing requests that used the old context to build the specs for all consul services https://github.com/solo-io/gloo/blob/7da575b8b13b2371d0b6ec60285a11fb0d9ddf68/projects/gloo/pkg/plugins/consul/eds.go#L175

we should probably allow this to act in a piecemeal fashion and in less of a state of the world approach. we may also want to explore filtering the requests to build specs

kdorosh avatar Jun 21 '22 20:06 kdorosh

we may also want to explore using caching / filtering options available here https://github.com/solo-io/gloo/blob/7da575b8b13b2371d0b6ec60285a11fb0d9ddf68/projects/gloo/pkg/upstreams/consul/consul_client.go#L115

note, user was seeing issues with cancelling context even with zero consul upstreams defined. Purely the size of their environment and number of services and our watch implementation is enough to reproduce the issue

kdorosh avatar Jun 21 '22 20:06 kdorosh

@chrisgaun and @nrjpoddar - this one is high priority. Can we get someone assigned?

willowmck avatar Aug 05 '22 14:08 willowmck

Is this higher compared to https://github.com/solo-io/gloo/issues/6815 ?

nrjpoddar avatar Aug 05 '22 14:08 nrjpoddar

looks like a dependency

willowmck avatar Aug 05 '22 15:08 willowmck

we will want to swap the endpoints (catalogservices in consul lingo) query to use blocking queries, which are supported

services already use them (and support them)

we may also want to update the services and catalog services queries to use their own context that doesn't get cancelled for these long polls (https://www.consul.io/api-docs/catalog#list-services)

The streaming backend, first introduced in Consul 1.10, is a replacement for the long polling backend. If streaming is supported by an endpoint, it will be used when either the index or cached query parameters are set.

if we can, this would fit more naturally into the assumption we already make in gloo (e.g. endpoint warming timeout) where the initial list has more time to complete than following calls / watches. since the consul client today always does bulk lists, every upstream change triggers an entire SOTW resync which is slower than our 1s hard-coded resync period (also may want to make configurable?) note: we may still want the initial 1s timeout configurable for the initial list, but then can be more responsive once the long polling is set up

kdorosh avatar Aug 15 '22 21:08 kdorosh