xds: add support for multiple xDS clients, for fallback
This PR adds support for multiple xDS clients, as required by xDS fallback(A71).
Summary of changes:
- xDS client creation APIs now require a name to be specified
- On the client side, the user's dial target is used, and on the server side a well-known dedicated key
#serveris used - Tests pass
t.Nameas the xDS client name - Also, added a
GetForTestingmethod since it is no longer possible to simply callNewand expect to get a reference to the singleton xDS client (as used by some tests to trigger resource not found errors) - Changes to the following resolvers to pass user's dial target during xDS client creation
-
google_c2p -
xds
-
- On the client side, the user's dial target is used, and on the server side a well-known dedicated key
- xDS client changes to dumping resources
- Removed the
DumpResourcesmethod from theXDSClientinterface - Added a
DumpResourcesfunction to thexdsclientpackage that traverses all existing xDS clients, retrieves a dump from each of them, and adds theclient_scopefield to each one, and returns a response that contains resources from all the clients
- Removed the
- CSDS service implementation changes
- No longer holds a reference to the xDS singleton. Instead invokes the
DumpResourcesAPI provided by thexdsclientpackage (that retrieves configs from all xDS clients in the system). - E2E test now creates more than one xDS client and verifies that the
client_scopefield is set appropriately.
- No longer holds a reference to the xDS singleton. Instead invokes the
- Minor improvements to stringing functionality in the
bootstrappackage to improve test log readability.
Fixes https://github.com/grpc/grpc-go/issues/6899
#a71-xds-fallback
RELEASE NOTES: none
I probably need to add one more test where the xDS clients are talking to different management servers, and receiving different resources. But the review can start prior to that, and I should have that test written soonish.
TestDumpResources is failing because the test context is expiring before we test can read the expected config dump from the xDS clients. My guess at this point is that because the go-control-plane keeps resending resources when they are NACKed, the gRPC client is overwhelmed by ADS responses. But I have to confirm this theory. Will update when I have a breakthrough.
Codecov Report
Attention: Patch coverage is 80.21978% with 18 lines in your changes missing coverage. Please review.
Project coverage is 81.32%. Comparing base (
4dd7f55) to head (ef92a2f). Report is 22 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #7347 +/- ##
==========================================
+ Coverage 80.58% 81.32% +0.73%
==========================================
Files 349 348 -1
Lines 34056 26744 -7312
==========================================
- Hits 27445 21749 -5696
+ Misses 5431 3795 -1636
- Partials 1180 1200 +20
| Files | Coverage Δ | |
|---|---|---|
| xds/csds/csds.go | 69.23% <100.00%> (-3.50%) |
:arrow_down: |
| xds/internal/resolver/xds_resolver.go | 80.71% <100.00%> (+1.78%) |
:arrow_up: |
| xds/internal/xdsclient/authority.go | 86.09% <100.00%> (-2.85%) |
:arrow_down: |
| xds/internal/xdsclient/clientimpl_dump.go | 100.00% <100.00%> (+15.00%) |
:arrow_up: |
| xds/server.go | 81.74% <100.00%> (-1.00%) |
:arrow_down: |
| xds/internal/xdsclient/clientimpl_loadreport.go | 69.23% <0.00%> (-4.11%) |
:arrow_down: |
| internal/testutils/xds/e2e/logging.go | 50.00% <50.00%> (ø) |
|
| xds/googledirectpath/googlec2p.go | 87.50% <33.33%> (+1.78%) |
:arrow_up: |
| xds/internal/xdsclient/client_refcounted.go | 83.72% <92.30%> (ø) |
|
| xds/internal/xdsclient/client_new.go | 86.04% <75.00%> (+1.76%) |
:arrow_up: |
| ... and 1 more |
Tests are passing now. This is good to be looked at.
Looks like there is a flake in the csds e2e test. I will continue to look at that, but this is still good for a second pass. Thanks.
The CSDS test is failing on Github Actions because of the way go-control-plane continuously resends NACKed resources. This results in a busy loop on the xdsClient where it receives a resource and NACKs it repeatedly. This results in starvation of the DumpResources call from the CSDS service.
I'm planning to put this PR on hold for now, fix https://github.com/grpc/grpc/issues/34099 and get back here.
@zasweq : Based on our offline discussion, I made a commit that skips the test on arm64. Waiting for GA to pass before I merge this. That way, I can work on the other PR implementing ADS stream flow control, and once I have that, I will be able to tweak this test and get it to work.
Sgtm, thinking about it I agree with Doug. To minimize the chance of merge conflicts probably best to get this in sooner rather than later.