envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Optimize EDS stream allocation

Open nezdolik opened this issue 3 years ago • 13 comments
trafficstars

Signed-off-by: Kateryna Nezdolii [email protected]

Commit Message: Optimize streams allocation for Eds Grpc Additional Description: With this change Eds Grpc subscriptions towards the same management server reuse a shared grpc connection, which greatly reduces the number of concurrent H2 streams towards a single management server. The work is built on top of 5026, with bugs fixed and an addition of integration test for Eds over Grpc. I could not port 3 existing Eds integration tests for warming clusters, as they were blocked by non graceful shutdown of GoogleAsyncClient.

The change has also been tested in production Spotify environment, we use Eds Grpc. Envoy successfully resolved cluster load assignments via optimized Eds.

Risk Level: High Testing: Unit tests + integration tests Docs Changes: TBD Release Notes: TBD Platform Specific Features: Fixes #2943

nezdolik avatar Jul 26 '22 22:07 nezdolik

cc @htuch we have chatted about this change long time ago prior to my parental leave.

nezdolik avatar Jul 26 '22 22:07 nezdolik

Thanks so much @nezdolik, this is a definitely long asked for feature from the community and will be a welcome contribution. Assigning @adisuissa for review, since Adi is now leading work around xDS architecture in Envoy.

htuch avatar Jul 27 '22 04:07 htuch

/wait on CI fixes

alyssawilk avatar Aug 01 '22 14:08 alyssawilk

Apologies for the late reply. Thanks for working on this, and thanks for trying this in prod and giving us the feedback! Overall I think we would like to have this support for any xDS service, not only for EDS.

I'll start by giving some background on where I think we should be heading, and then answer your questions.

In the future, as we move towards the new xDS-TP, the subscriptions to the same authority will essentially be multiplexed using this feature. Your optimized EDS stream allocation will work for each ADS subscription to a specific authority. For now, we have a single gRPC stream when ADS is used (which doesn't need this multiplexing), and for non-ADS we have per-subscription stream.

IIUC the current design is: The cluster manager has a single EdsSubscriptionFactory which maps between config and a gRPC stream. In the future, we will probably have a ConfigSubscriptionsManager in the server, that will own all the subscriptions, and a map between a config-source (and authority) to its actual GrpcMux. The GrpcMux can be configured to either SotW or Delta, and will abstract over it (just as is being done today).

For this PR I've got a few high-level comments to start with:

  1. Should the EdsSubscriptionFactory extend SubscriptionFactoryImpl, and just have a method to return the correct GrpcMux (overriding the current create methods, and calling them if the subscription isn't there already)?
  2. Can we use a different name for this factory, maybe MultiplexedSubscriptionFactory? (other ideas to make it more generic are welcome :) )
  3. Can we add some integration tests?
  4. For now I do agree that this factory should co-exist with the current (non-multiplexed) subscription factory, but we should gradually move away from using the original one. I think that introducing (and guarding) it for each xDS-type is the right way forward.

Addressing your questions:

Do we need support for Delta Grpc APi type? Yes, we should support both SotW and Delta, but I don't think it will add too much to this implementation.

Do we need support for unified mux? The unified-mux code essentially uses the same code-base for both SotW and Delta implementations, so yes we would like to support it. In the near future we will move to that code-base and abandon the old one. The unified mux code should implement the GrpcMux interface, so I don't think it will require much work.

Implement collectionSubscriptionFromUrl method? (part of udpa/resource federation) Yes, it should be implemented, but I think it is very straight-forward.

Hide change behind feature flag in api? (as it is high risk) I think this is highly desired, at least temporarily, as we want to allow the use of the current subscription code if we break anyone.

LMK if there's something unclear, and if I can help with anything.

adisuissa avatar Aug 03 '22 12:08 adisuissa

@adisuissa thanks for a very detailed feedback and apologies for late reply, I have been stuck in figuring out mocks fix for the remaining Eds tests.

I agree that MultiplexedSubscriptionFactory should be a general feature that could be enabled for any XDS type (since RDS and SDS currently expose the same perf problem for a large number of resources).We could start by implementing a generic MultiplexedSubscriptionFactory and exposing it as a config option in EDS api as a start, does that sound good?

I will start addressing your feedback in a series of commits and will ping you when all parts are done.

May need some help later with the integration test, I was not completely sure on which signal/stats to rely to verify that grpc channel is being multiplexed in the test(may figure it out later in the process). The original idea was to look at Envoy stats for successfully completed requests for an H2 Eds cluster, since in H2 each stream maps to a single logical request/response.

nezdolik avatar Aug 13 '22 22:08 nezdolik

I agree that MultiplexedSubscriptionFactory should be a general feature that could be enabled for any XDS type (since RDS and SDS currently expose the same perf problem for a large number of resources).We could start by implementing a generic MultiplexedSubscriptionFactory and exposing it as a config option in EDS api as a start, does that sound good?

Sounds good to me, thanks for all the efforts!

May need some help later with the integration test, I was not completely sure on which signal/stats to rely to verify that grpc channel is being multiplexed in the test(may figure it out later in the process). The original idea was to look at Envoy stats for successfully completed requests for an H2 Eds cluster, since in H2 each stream maps to a single logical request/response.

I wonder if the upstream statistics can be used to count the number of connections. If you point me to the specific test, I can try to take a look.

adisuissa avatar Aug 15 '22 09:08 adisuissa

/wait-any

adisuissa avatar Aug 17 '22 13:08 adisuissa

Applied portion of review comments:

  • EdsSubscriptionFactory renamed to MultiplexedSubscriptionFactory
  • MultiplexedSubscriptionFactory extends SubscriptionFactoryImpl and overrides method for creating mux
  • MultiplexedSubscriptionFactory shares mux for subscriptions per management server per xds type url.
  • Updated MultiplexedSubscriptionFactory test

I will now work on integration test, add tests for new feature and make it parametrised for xds type (sotw, delta) and grpc flavour (envoy, google). @adisuissa I am not sure if upstream stats for number of connections is good enough signal to rely on. Since mux manages a single grpc stream (according to docs) and grpc channel may have many streams.

nezdolik avatar Aug 19 '22 22:08 nezdolik

  • Added integration test to verify that discovery requests for multiple EDS clusters reuse one stream/mux towards same management server.

Looks like there is merge conflict that needs to be fixed. Will work now on adding api feature switch in EDS config and extending tests for the config property.

nezdolik avatar Sep 04 '22 22:09 nezdolik

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy[\w/]*/(v1alpha\d?|v1|v2alpha\d?|v2))|(api/envoy/type/(matcher/)?\w+.proto). CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/). envoyproxy/api-shepherds assignee is @adisuissa CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/22419 was synchronize by nezdolik.

see: more, trace.

/retest

nezdolik avatar Sep 19 '22 21:09 nezdolik

Retrying Azure Pipelines: Check envoy-presubmit isn't fully completed, but will still attempt retrying. Retried failed jobs in: envoy-presubmit

:cat:

Caused by: a https://github.com/envoyproxy/envoy/pull/22419#issuecomment-1251570686 was created by @nezdolik.

see: more, trace.

@adisuissa this is ready for review

nezdolik avatar Sep 19 '22 21:09 nezdolik

@adisuissa and @lizan this is ready for review

yanavlasov avatar Sep 27 '22 14:09 yanavlasov

/wait

RyanTheOptimist avatar Oct 10 '22 14:10 RyanTheOptimist

Sorry just catching up on this issue. I would like to get this fixed for the folks at Spotify. Are we good with this overall change if we call it a bug and fix it under runtime flag? That seems fine to me. cc @markdroth @nezdolik

mattklein123 avatar Oct 14 '22 15:10 mattklein123

Triggering this via a runtime flag instead of a new API field SGTM.

markdroth avatar Oct 14 '22 16:10 markdroth

@mattklein123 @markdroth switched to using runtime feature flag for eds multiplexing.

nezdolik avatar Nov 03 '22 11:11 nezdolik

@adisuissa could you give this another pass please

phlax avatar Nov 08 '22 08:11 phlax

Thanks! I'm happy with this change in that it no longer includes any API changes. So at this point, I'll leave the actual review to @adisuissa.

markdroth avatar Nov 08 '22 16:11 markdroth

@adisuissa this is ready for review (applied review comments, added test for stream per cluster, enabled multiplexing by default). Only piece not done is integration test for turning on/off the flag via runtime.

nezdolik avatar Nov 24 '22 22:11 nezdolik

Thanks for working on this! Code LGTM, and I still need to go through the last integration test (sorry, it is quite large). RE: "Only piece not done is integration test for turning on/off the flag via runtime", I thought that MultiplexedDeltaSotwIntegrationParamTest (which EdsOverGrpcIntegrationTest extends) takes care of that, no?

It does yes, what i meant is there is no test that turns flag on and then off via runtime in a single test.

nezdolik avatar Dec 01 '22 23:12 nezdolik

This bug fix is important for our company as it unblocks significant cost cut down for cross-zone traffic for all services. We intended to use the fix early Q1, hopefully we can get it merged by then and not use internal fork with patch included.

nezdolik avatar Dec 02 '22 09:12 nezdolik

/retest

adisuissa avatar Dec 05 '22 14:12 adisuissa

Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit

:cat:

Caused by: a https://github.com/envoyproxy/envoy/pull/22419#issuecomment-1337444395 was created by @adisuissa.

see: more, trace.

/assign-from @envoyproxy/senior-maintainers

adisuissa avatar Dec 05 '22 14:12 adisuissa

@envoyproxy/senior-maintainers assignee is @lizan

:cat:

Caused by: a https://github.com/envoyproxy/envoy/pull/22419#issuecomment-1337444891 was created by @adisuissa.

see: more, trace.

This needs a main merge but in the meantime @lizan can you please take a look?

alyssawilk avatar Dec 08 '22 17:12 alyssawilk

The patch has been successfully tested in Spotify prod environment, @lizan

nezdolik avatar Dec 12 '22 12:12 nezdolik

friendly ping @lizan

I think this is waiting for your input.

@nezdolik this needs a main merge

/wait

KBaichoo avatar Jan 10 '23 18:01 KBaichoo