grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

xds: add support for multiple xDS clients, for fallback

Open easwars opened this issue 1 year ago • 5 comments

This PR adds support for multiple xDS clients, as required by xDS fallback(A71).

Summary of changes:

  • xDS client creation APIs now require a name to be specified
    • On the client side, the user's dial target is used, and on the server side a well-known dedicated key #server is used
    • Tests pass t.Name as the xDS client name
    • Also, added a GetForTesting method since it is no longer possible to simply call New and expect to get a reference to the singleton xDS client (as used by some tests to trigger resource not found errors)
    • Changes to the following resolvers to pass user's dial target during xDS client creation
      • google_c2p
      • xds
  • xDS client changes to dumping resources
    • Removed the DumpResources method from the XDSClient interface
    • Added a DumpResources function to the xdsclient package that traverses all existing xDS clients, retrieves a dump from each of them, and adds the client_scope field to each one, and returns a response that contains resources from all the clients
  • CSDS service implementation changes
    • No longer holds a reference to the xDS singleton. Instead invokes the DumpResources API provided by the xdsclient package (that retrieves configs from all xDS clients in the system).
    • E2E test now creates more than one xDS client and verifies that the client_scope field is set appropriately.
  • Minor improvements to stringing functionality in the bootstrap package to improve test log readability.

Fixes https://github.com/grpc/grpc-go/issues/6899

#a71-xds-fallback

RELEASE NOTES: none

easwars avatar Jun 24 '24 19:06 easwars

I probably need to add one more test where the xDS clients are talking to different management servers, and receiving different resources. But the review can start prior to that, and I should have that test written soonish.

easwars avatar Jun 24 '24 19:06 easwars

TestDumpResources is failing because the test context is expiring before we test can read the expected config dump from the xDS clients. My guess at this point is that because the go-control-plane keeps resending resources when they are NACKed, the gRPC client is overwhelmed by ADS responses. But I have to confirm this theory. Will update when I have a breakthrough.

easwars avatar Jun 24 '24 20:06 easwars

Codecov Report

Attention: Patch coverage is 80.21978% with 18 lines in your changes missing coverage. Please review.

Project coverage is 81.32%. Comparing base (4dd7f55) to head (ef92a2f). Report is 22 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7347      +/-   ##
==========================================
+ Coverage   80.58%   81.32%   +0.73%     
==========================================
  Files         349      348       -1     
  Lines       34056    26744    -7312     
==========================================
- Hits        27445    21749    -5696     
+ Misses       5431     3795    -1636     
- Partials     1180     1200      +20     
Files Coverage Δ
xds/csds/csds.go 69.23% <100.00%> (-3.50%) :arrow_down:
xds/internal/resolver/xds_resolver.go 80.71% <100.00%> (+1.78%) :arrow_up:
xds/internal/xdsclient/authority.go 86.09% <100.00%> (-2.85%) :arrow_down:
xds/internal/xdsclient/clientimpl_dump.go 100.00% <100.00%> (+15.00%) :arrow_up:
xds/server.go 81.74% <100.00%> (-1.00%) :arrow_down:
xds/internal/xdsclient/clientimpl_loadreport.go 69.23% <0.00%> (-4.11%) :arrow_down:
internal/testutils/xds/e2e/logging.go 50.00% <50.00%> (ø)
xds/googledirectpath/googlec2p.go 87.50% <33.33%> (+1.78%) :arrow_up:
xds/internal/xdsclient/client_refcounted.go 83.72% <92.30%> (ø)
xds/internal/xdsclient/client_new.go 86.04% <75.00%> (+1.76%) :arrow_up:
... and 1 more

... and 292 files with indirect coverage changes

codecov[bot] avatar Jun 24 '24 21:06 codecov[bot]

Tests are passing now. This is good to be looked at.

easwars avatar Jun 25 '24 15:06 easwars

Looks like there is a flake in the csds e2e test. I will continue to look at that, but this is still good for a second pass. Thanks.

easwars avatar Jun 27 '24 23:06 easwars

The CSDS test is failing on Github Actions because of the way go-control-plane continuously resends NACKed resources. This results in a busy loop on the xdsClient where it receives a resource and NACKs it repeatedly. This results in starvation of the DumpResources call from the CSDS service.

I'm planning to put this PR on hold for now, fix https://github.com/grpc/grpc/issues/34099 and get back here.

easwars avatar Jul 01 '24 16:07 easwars

@zasweq : Based on our offline discussion, I made a commit that skips the test on arm64. Waiting for GA to pass before I merge this. That way, I can work on the other PR implementing ADS stream flow control, and once I have that, I will be able to tweak this test and get it to work.

easwars avatar Jul 02 '24 22:07 easwars

Sgtm, thinking about it I agree with Doug. To minimize the chance of merge conflicts probably best to get this in sooner rather than later.

zasweq avatar Jul 02 '24 22:07 zasweq