envoy
envoy copied to clipboard
Optimize EDS stream allocation
Signed-off-by: Kateryna Nezdolii [email protected]
Commit Message: Optimize streams allocation for Eds Grpc Additional Description: With this change Eds Grpc subscriptions towards the same management server reuse a shared grpc connection, which greatly reduces the number of concurrent H2 streams towards a single management server. The work is built on top of 5026, with bugs fixed and an addition of integration test for Eds over Grpc. I could not port 3 existing Eds integration tests for warming clusters, as they were blocked by non graceful shutdown of GoogleAsyncClient.
The change has also been tested in production Spotify environment, we use Eds Grpc. Envoy successfully resolved cluster load assignments via optimized Eds.
Risk Level: High Testing: Unit tests + integration tests Docs Changes: TBD Release Notes: TBD Platform Specific Features: Fixes #2943
cc @htuch we have chatted about this change long time ago prior to my parental leave.
Thanks so much @nezdolik, this is a definitely long asked for feature from the community and will be a welcome contribution. Assigning @adisuissa for review, since Adi is now leading work around xDS architecture in Envoy.
/wait on CI fixes
Apologies for the late reply. Thanks for working on this, and thanks for trying this in prod and giving us the feedback! Overall I think we would like to have this support for any xDS service, not only for EDS.
I'll start by giving some background on where I think we should be heading, and then answer your questions.
In the future, as we move towards the new xDS-TP, the subscriptions to the same authority will essentially be multiplexed using this feature. Your optimized EDS stream allocation will work for each ADS subscription to a specific authority. For now, we have a single gRPC stream when ADS is used (which doesn't need this multiplexing), and for non-ADS we have per-subscription stream.
IIUC the current design is: The cluster manager has a single EdsSubscriptionFactory which maps between config and a gRPC stream.
In the future, we will probably have a ConfigSubscriptionsManager in the server, that will own all the subscriptions, and a map between a config-source (and authority) to its actual GrpcMux. The GrpcMux can be configured to either SotW or Delta, and will abstract over it (just as is being done today).
For this PR I've got a few high-level comments to start with:
- Should the EdsSubscriptionFactory extend SubscriptionFactoryImpl, and just have a method to return the correct GrpcMux (overriding the current create methods, and calling them if the subscription isn't there already)?
- Can we use a different name for this factory, maybe
MultiplexedSubscriptionFactory? (other ideas to make it more generic are welcome :) ) - Can we add some integration tests?
- For now I do agree that this factory should co-exist with the current (non-multiplexed) subscription factory, but we should gradually move away from using the original one. I think that introducing (and guarding) it for each xDS-type is the right way forward.
Addressing your questions:
Do we need support for Delta Grpc APi type? Yes, we should support both SotW and Delta, but I don't think it will add too much to this implementation.
Do we need support for unified mux? The unified-mux code essentially uses the same code-base for both SotW and Delta implementations, so yes we would like to support it. In the near future we will move to that code-base and abandon the old one. The unified mux code should implement the GrpcMux interface, so I don't think it will require much work.
Implement collectionSubscriptionFromUrl method? (part of udpa/resource federation) Yes, it should be implemented, but I think it is very straight-forward.
Hide change behind feature flag in api? (as it is high risk) I think this is highly desired, at least temporarily, as we want to allow the use of the current subscription code if we break anyone.
LMK if there's something unclear, and if I can help with anything.
@adisuissa thanks for a very detailed feedback and apologies for late reply, I have been stuck in figuring out mocks fix for the remaining Eds tests.
I agree that MultiplexedSubscriptionFactory should be a general feature that could be enabled for any XDS type (since RDS and SDS currently expose the same perf problem for a large number of resources).We could start by implementing a generic MultiplexedSubscriptionFactory and exposing it as a config option in EDS api as a start, does that sound good?
I will start addressing your feedback in a series of commits and will ping you when all parts are done.
May need some help later with the integration test, I was not completely sure on which signal/stats to rely to verify that grpc channel is being multiplexed in the test(may figure it out later in the process). The original idea was to look at Envoy stats for successfully completed requests for an H2 Eds cluster, since in H2 each stream maps to a single logical request/response.
I agree that MultiplexedSubscriptionFactory should be a general feature that could be enabled for any XDS type (since RDS and SDS currently expose the same perf problem for a large number of resources).We could start by implementing a generic MultiplexedSubscriptionFactory and exposing it as a config option in EDS api as a start, does that sound good?
Sounds good to me, thanks for all the efforts!
May need some help later with the integration test, I was not completely sure on which signal/stats to rely to verify that grpc channel is being multiplexed in the test(may figure it out later in the process). The original idea was to look at Envoy stats for successfully completed requests for an H2 Eds cluster, since in H2 each stream maps to a single logical request/response.
I wonder if the upstream statistics can be used to count the number of connections. If you point me to the specific test, I can try to take a look.
/wait-any
Applied portion of review comments:
- EdsSubscriptionFactory renamed to MultiplexedSubscriptionFactory
- MultiplexedSubscriptionFactory extends SubscriptionFactoryImpl and overrides method for creating mux
- MultiplexedSubscriptionFactory shares mux for subscriptions per management server per xds type url.
- Updated MultiplexedSubscriptionFactory test
I will now work on integration test, add tests for new feature and make it parametrised for xds type (sotw, delta) and grpc flavour (envoy, google). @adisuissa I am not sure if upstream stats for number of connections is good enough signal to rely on. Since mux manages a single grpc stream (according to docs) and grpc channel may have many streams.
- Added integration test to verify that discovery requests for multiple EDS clusters reuse one stream/mux towards same management server.
Looks like there is merge conflict that needs to be fixed. Will work now on adding api feature switch in EDS config and extending tests for the config property.
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy[\w/]*/(v1alpha\d?|v1|v2alpha\d?|v2))|(api/envoy/type/(matcher/)?\w+.proto).
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @adisuissa
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).
/retest
Retrying Azure Pipelines: Check envoy-presubmit isn't fully completed, but will still attempt retrying. Retried failed jobs in: envoy-presubmit
@adisuissa this is ready for review
@adisuissa and @lizan this is ready for review
/wait
Sorry just catching up on this issue. I would like to get this fixed for the folks at Spotify. Are we good with this overall change if we call it a bug and fix it under runtime flag? That seems fine to me. cc @markdroth @nezdolik
Triggering this via a runtime flag instead of a new API field SGTM.
@mattklein123 @markdroth switched to using runtime feature flag for eds multiplexing.
@adisuissa could you give this another pass please
Thanks! I'm happy with this change in that it no longer includes any API changes. So at this point, I'll leave the actual review to @adisuissa.
@adisuissa this is ready for review (applied review comments, added test for stream per cluster, enabled multiplexing by default). Only piece not done is integration test for turning on/off the flag via runtime.
Thanks for working on this! Code LGTM, and I still need to go through the last integration test (sorry, it is quite large). RE: "Only piece not done is integration test for turning on/off the flag via runtime", I thought that
MultiplexedDeltaSotwIntegrationParamTest(whichEdsOverGrpcIntegrationTestextends) takes care of that, no?
It does yes, what i meant is there is no test that turns flag on and then off via runtime in a single test.
This bug fix is important for our company as it unblocks significant cost cut down for cross-zone traffic for all services. We intended to use the fix early Q1, hopefully we can get it merged by then and not use internal fork with patch included.
/retest
Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit
/assign-from @envoyproxy/senior-maintainers
@envoyproxy/senior-maintainers assignee is @lizan
This needs a main merge but in the meantime @lizan can you please take a look?
The patch has been successfully tested in Spotify prod environment, @lizan
friendly ping @lizan
I think this is waiting for your input.
@nezdolik this needs a main merge
/wait