gloo
gloo copied to clipboard
Failure when integrating with Consul instance with a high number of services and endpoints
Gloo Edge Version
1.11.x (latest stable)
Kubernetes Version
No response
Describe the bug
gloo throws error when integrating with a Consul instance with a high number of services (e.g. 5000) and endpoints (e.g. 75000):
{"level":"error","ts":"2022-06-17T23:21:39.701Z","logger":"gloo.v1.event_loop.setup.v1.event_loop.syncer.consul_eds","caller":"consul/eds.go:183","msg":"write error channel is full! could not propagate err: Get \"http://172.17.0.1:8500/v1/catalog/service/raas-xiaowei-name-test-redis?consistent=&dc=sitetest3\": context canceled","version":"1.12.0-beta17","stacktrace":"[github.com/solo-io/gloo/projects/gloo/pkg/plugins/consul.refreshSpecs](http://github.com/solo-io/gloo/projects/gloo/pkg/plugins/consul.refreshSpecs)\n\t/workspace/gloo/projects/gloo/pkg/plugins/consul/eds.go:183\[ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/consul.(*plugin).WatchEndpoints.func2](http://ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/consul.(*plugin).WatchEndpoints.func2)\n\t/workspace/gloo/projects/gloo/pkg/plugins/consul/eds.go:99"}
Steps to reproduce the bug
- Deploy Consul with high number of services and endpoints
- Disable discovery
- Configure GE for Consul-based eDS
Expected Behavior
Handle high service& endpoint processing
Additional Context
No response
I suspect updates off the service meta channel https://github.com/solo-io/gloo/blob/master/projects/gloo/pkg/plugins/consul/eds.go#L92-L93 are clobbering ongoing requests that used the old context to build the specs for all consul services https://github.com/solo-io/gloo/blob/7da575b8b13b2371d0b6ec60285a11fb0d9ddf68/projects/gloo/pkg/plugins/consul/eds.go#L175
we should probably allow this to act in a piecemeal fashion and in less of a state of the world approach. we may also want to explore filtering the requests to build specs
we may also want to explore using caching / filtering options available here https://github.com/solo-io/gloo/blob/7da575b8b13b2371d0b6ec60285a11fb0d9ddf68/projects/gloo/pkg/upstreams/consul/consul_client.go#L115
note, user was seeing issues with cancelling context even with zero consul upstreams defined. Purely the size of their environment and number of services and our watch implementation is enough to reproduce the issue
@chrisgaun and @nrjpoddar - this one is high priority. Can we get someone assigned?
Is this higher compared to https://github.com/solo-io/gloo/issues/6815 ?
looks like a dependency
we will want to swap the endpoints (catalogservices in consul lingo) query to use blocking queries, which are supported
services already use them (and support them)
we may also want to update the services and catalog services queries to use their own context that doesn't get cancelled for these long polls (https://www.consul.io/api-docs/catalog#list-services)
The streaming backend, first introduced in Consul 1.10, is a replacement for the long polling backend. If streaming is supported by an endpoint, it will be used when either the index or cached query parameters are set.
if we can, this would fit more naturally into the assumption we already make in gloo (e.g. endpoint warming timeout) where the initial list has more time to complete than following calls / watches. since the consul client today always does bulk lists, every upstream change triggers an entire SOTW resync which is slower than our 1s hard-coded resync period (also may want to make configurable?) note: we may still want the initial 1s timeout configurable for the initial list, but then can be more responsive once the long polling is set up