grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

xdsclient: update watcher API as per gRFC A88

Open purnesh42H opened this issue 1 year ago • 14 comments

This is the first part of implementing gRFC A88 (https://github.com/grpc/proposal/pull/466).

This introduces the new watcher API but does not change any of the existing behavior. This table summarizes the API changes and behavior for each case:

Case Old API New API Behavior
Resource timer fires OnResourceDoesNotExist() OnResourceChanged(NOT_FOUND) Fail data plane RPCs
LDS or CDS resource deletion OnResourceDoesNotExist() OnResourceChanged(NOT_FOUND) Drop resource and fail data plane RPCs
xDS channel reports TRANSIENT_FAILURE OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
ADS stream terminates without receiving a response OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
Invalid resource update (client NACK) OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
Valid resource update OnUpdate(resource) OnResourceChanged(resource) use the new resource

RELEASE NOTES:

  • xds: TBD

purnesh42H avatar Jan 02 '25 21:01 purnesh42H

Codecov Report

Attention: Patch coverage is 80.20305% with 39 lines in your changes missing coverage. Please review.

Project coverage is 82.21%. Comparing base (ce35fd4) to head (c43cd62). Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/server/listener_wrapper.go 68.88% 12 Missing and 2 partials :warning:
xds/internal/resolver/xds_resolver.go 47.82% 10 Missing and 2 partials :warning:
xds/internal/server/rds_handler.go 61.11% 7 Missing :warning:
.../balancer/clusterresolver/resource_resolver_eds.go 50.00% 3 Missing and 1 partial :warning:
xds/internal/testutils/resource_watcher.go 83.33% 2 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7977      +/-   ##
==========================================
+ Coverage   82.16%   82.21%   +0.05%     
==========================================
  Files         410      412       +2     
  Lines       40248    40487     +239     
==========================================
+ Hits        33068    33287     +219     
- Misses       5830     5839       +9     
- Partials     1350     1361      +11     
Files with missing lines Coverage Δ
internal/xds/bootstrap/bootstrap.go 65.72% <ø> (ø)
xds/internal/balancer/cdsbalancer/cdsbalancer.go 84.49% <100.00%> (-0.15%) :arrow_down:
...s/internal/balancer/cdsbalancer/cluster_watcher.go 100.00% <100.00%> (ø)
...rnal/balancer/clusterresolver/resource_resolver.go 94.44% <100.00%> (ø)
xds/internal/resolver/serviceconfig.go 85.71% <100.00%> (ø)
xds/internal/resolver/watch_service.go 100.00% <100.00%> (+8.57%) :arrow_up:
xds/internal/xdsclient/authority.go 80.38% <100.00%> (+0.60%) :arrow_up:
xds/internal/xdsclient/clientimpl_watchers.go 84.81% <100.00%> (+0.39%) :arrow_up:
...nal/xdsclient/xdsresource/cluster_resource_type.go 77.14% <100.00%> (ø)
...l/xdsclient/xdsresource/endpoints_resource_type.go 75.00% <100.00%> (ø)
... and 8 more

... and 41 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jan 06 '25 12:01 codecov[bot]

@easwars ptal

purnesh42H avatar Jan 31 '25 11:01 purnesh42H

This came up during some other discussion with @dfawley.

  1. Callback methods named OnXxx are more of a C++ style. In Go, we would probably just name them Xxx.
  2. Whether we should have two callback methods, one named ResourceChanged(ResourceData), and the other named ResourceError(error). This is to work around the fact that we don't have a StatusOr type in Go, and without that, having two APIs is better than one API (where we need to add a new struct to hold one of two possible values, with no guarantees that only one of the two values is set).

easwars avatar Feb 06 '25 19:02 easwars

This came up during some other discussion with @dfawley.

  1. Callback methods named OnXxx are more of a C++ style. In Go, we would probably just name them Xxx.
  2. Whether we should have two callback methods, one named ResourceChanged(ResourceData), and the other named ResourceError(error). This is to work around the fact that we don't have a StatusOr type in Go, and without that, having two APIs is better than one API (where we need to add a new struct to hold one of two possible values, with no guarantees that only one of the two values is set).

I will wait for https://github.com/grpc/grpc-go/pull/8042/ before making further changes to this PR since that has the discussion on decision

purnesh42H avatar Feb 18 '25 06:02 purnesh42H

@easwars I have made the changes to match the final a88 watcher interface ResourceChanged, ResourceError and AmbientError including the conditions to call ResourceError or AmbientError based on cache.

Everything else is working fine but looks like this TestServeAndCloseDoNotRace is causing race https://github.com/grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test. I can't seem to figure how the watcher change is affecting this. Could you help take a look? Thanks.

purnesh42H avatar Mar 26 '25 20:03 purnesh42H

@easwars I have made the changes to match the final a88 watcher interface ResourceChanged, ResourceError and AmbientError including the conditions to call ResourceError or AmbientError based on cache.

Everything else is working fine but looks like this TestServeAndCloseDoNotRace is causing race https://github.com/grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test. I can't seem to figure how the watcher change is affecting this. Could you help take a look? Thanks.

Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?

purnesh42H avatar Mar 26 '25 20:03 purnesh42H

@easwars I have made the changes to match the final a88 watcher interface ResourceChanged, ResourceError and AmbientError including the conditions to call ResourceError or AmbientError based on cache. Everything else is working fine but looks like this TestServeAndCloseDoNotRace is causing race https://github.com/grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test. I can't seem to figure how the watcher change is affecting this. Could you help take a look? Thanks.

Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?

Actually, I thought over it and right thing to do here is handle this case in listener wrapper. Like previously done, we handle lds error only in case of resource-not-found. Let me know if that's okay.

purnesh42H avatar Mar 27 '25 04:03 purnesh42H

Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?

If we start a watch for a resource and the authority of that resource is not found in the bootstrap config, then the right behavior is to report a non-ambient error to the watcher.

In general, we should never call OnAmbientError() for a watcher before we call OnResourceChanged(). If we get a transient error before we've seen a valid resource, then the transient error should be reported to OnResourceChanged() instead of OnAmbientError().

markdroth avatar Mar 27 '25 15:03 markdroth

Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?

If we start a watch for a resource and the authority of that resource is not found in the bootstrap config, then the right behavior is to report a non-ambient error to the watcher.

In general, we should never call OnAmbientError() for a watcher before we call OnResourceChanged(). If we get a transient error before we've seen a valid resource, then the transient error should be reported to OnResourceChanged() instead of OnAmbientError().

yeah handled it in ListenerWrapper. Still calling ResourceError for channel not found

purnesh42H avatar Mar 27 '25 18:03 purnesh42H

yeah handled it in ListenerWrapper

I haven't looked at the code, but that seems wrong to me. There should not need to be any special handling for individual resource types -- the logic for this in XdsClient should be completely resource-type-agnostic.

The only thing that should be handled in individual resource types are things that are different from other resource types, like determining what resource type to pass to the watcher.

markdroth avatar Mar 27 '25 18:03 markdroth

yeah handled it in ListenerWrapper

I haven't looked at the code, but that seems wrong to me. There should not need to be any special handling for individual resource types -- the logic for this in XdsClient should be completely resource-type-agnostic.

The only thing that should be handled in individual resource types are things that are different from other resource types, like determining what resource type to pass to the watcher.

This error for channel not found while registering watch https://github.com/grpc/grpc-go/pull/7977/files#diff-b4a775482793516a8ed23bd6a58a805d316f5abb6468580691c2b0556e22606aR619 was recently added. Listener wrapper for grpc xds server had a logic to stop serving only if error is of type ResourceNotFound so i just kept that instead of stopping always https://github.com/grpc/grpc-go/pull/7977/files#diff-e4706c72ae912399b7f8ee6f04cec2374ef7a7679b12358f201ddb0b45e34146R453-R457. If we don't do that TestServeAndCloseDoNotRace is causing the race because we are creating and stopping grpc xds servers in a loop . @easwars do you think we should rewrite this test?

purnesh42H avatar Mar 27 '25 19:03 purnesh42H

Listener wrapper for grpc xds server had a logic to stop serving only if error is of type ResourceNotFound so i just kept that instead of stopping always https://github.com/grpc/grpc-go/pull/7977/files#diff-e4706c72ae912399b7f8ee6f04cec2374ef7a7679b12358f201ddb0b45e34146R453-R457.

I don't think that check is correct -- it should be removed. If the XdsClient delivers an error, it should always be passed through to the watcher.

That's an inherent part of the API change we're making in this PR. The errors that we were previously ignoring in OnError() are now being passed to OnAmbientError() instead. If you are seeing an error reported to OnError(), that means we actually want to drop the resource we saw previously, if any.

markdroth avatar Mar 27 '25 20:03 markdroth

Yes, I agree that we should not have checks for "resource-not-found" errors in the individual watchers. In fact, I'm thinking we might not even need those error types like xdsresource.ErrorTypeResourceNotFound anymore. If ResourceError is invoked, then the watcher must stop using the previous configuration and if AmbientError is invoked, it should continue using the previous configuration and the xDS client should guarantee that AmbientError is only invoked when a previous copy of the resource was delivered via ResourceChanged.

easwars avatar Mar 27 '25 21:03 easwars

Everything else is working fine but looks like this TestServeAndCloseDoNotRace is causing race grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test.

Sorry, I don't have the cycles to debug this failure. But if you can dig a little deeper and get more info on the sequence of events leading to the failure, and why exactly the flake is happening, I might be able to help better. Once we have that information, we will be able to decide if we need to rewrite the test or if there is something wrong we are doing in the xDS client. Thanks.

easwars avatar Mar 27 '25 21:03 easwars

Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?

If we start a watch for a resource and the authority of that resource is not found in the bootstrap config, then the right behavior is to report a non-ambient error to the watcher.

In general, we should never call OnAmbientError() for a watcher before we call OnResourceChanged(). If we get a transient error before we've seen a valid resource, then the transient error should be reported to OnResourceChanged() instead of OnAmbientError().

@markdroth the case we are looking here is actually channel creation failure. In c++ and java that's never expected but in go recently a change was added to report a transient error if channel creation failed. Now, based on A88, this should be a non-ambient error only if we don't have a cached resource otherwise it should be an ambient error. Is that the correct behavior?

purnesh42H avatar Apr 04 '25 15:04 purnesh42H

@markdroth the case we are looking here is actually channel creation failure. In c++ and java that's never expected but in go recently a change was added to report a transient error if channel creation failed. Now, based on A88, this should be a non-ambient error only if we don't have a cached resource otherwise it should be an ambient error. Is that the correct behavior?

Yes. But in practice I suspect this case will never happen when we have a cached resource, because the only cases where channel creation fails should be things like invalid target URIs, in which case the channel creation will never succeed for the life of the XdsClient, since the bootstrap config (which specifies the target URI) can't change after the XdsClient is created. And if the channel creation never succeeds, then there's no way we can have gotten a cached resource.

The only exception I can think of to that would be fallback, if channel creation is failing for the primary server but not for the secondary server. But in general, channel-level errors on the primary server should not be sent to the watcher in the first place if there's a secondary server to fall back to, so I think that's a non-issue.

markdroth avatar Apr 04 '25 15:04 markdroth

@easwars the expected behavior, according to A88 is if resource is NACKed and is not cached, the watcher should drop the resource. So, in the case of xDS enabled gRPC server, if listener resource is not cached, server needs to drop the listener resource and should move to "not serving". Another thing to note is we don't suppress not_serving->not_serving transition and invoke the mode change callback each time.

So, i have modified the server tests cases accordingly,

  • TestServer_Security_NoCertificateProvidersInBootstrap_FailureTest
  • Server_Security_WithValidAndInvalidSecurityConfiguration In these, we need to drain the modeCh and errorCh channel of test mode change callback each time serving mode is switched but also make sure that it is always moving to "not serving".

For, TestResolverBadServiceUpdate, I have split into 2 parts. 1) without cached 2) with cached

For, TestServeAndCloseDoNotRace, override the default ServingModeCallback because the listener resource is with invalid uri which is always expected to cause resource error leading to serving mode change. And default ServingModeCallback logs an error in that case. For this test that's irrelevant.

With this I think we have verified how the existing functionality is affected. I will also import this PR in google3 and make sure all the prod resolvers including directpath are working as expected.

purnesh42H avatar Apr 07 '25 07:04 purnesh42H

I don't understand what you mean by dropping the resource? Why should the listener watcher drop the resource if it sees the resource being NACKed when it was not cached previously? It should continue watching that resource in the hopes that a subsequent update for the same resource will be ACKed.

easwars avatar Apr 07 '25 17:04 easwars

I don't understand what you mean by dropping the resource? Why should the listener watcher drop the resource if it sees the resource being NACKed when it was not cached previously? It should continue watching that resource in the hopes that a subsequent update for the same resource will be ACKed.

oh by dropping the resource it doesn't mean stop watching the resource. It just means that if the consumer of watcher was using the resource, they should stop using it. So, in the case of server it leads to serving mode change to "not serving" if listener resource is nacked but watcher will still have the watch on for the listener resource.

purnesh42H avatar Apr 07 '25 17:04 purnesh42H

the expected behavior, according to A88 is if resource is NACKed and is not cached, the watcher should drop the resource

If we are talking about the case where a NACK is seen without the resource being cached, then the question of not using the resource anymore does not come into the picture, because the resource does not exist yet.

easwars avatar Apr 07 '25 17:04 easwars

Yes, I agree that we should not have checks for "resource-not-found" errors in the individual watchers. In fact, I'm thinking we might not even need those error types like xdsresource.ErrorTypeResourceNotFound anymore. If ResourceError is invoked, then the watcher must stop using the previous configuration and if AmbientError is invoked, it should continue using the previous configuration and the xDS client should guarantee that AmbientError is only invoked when a previous copy of the resource was delivered via ResourceChanged.

While making another pass through this, I really think we should get rid of these error types, or at least look into getting rid of them. The reason behind my proposal is that when a watcher gets a ResourceError, it should simply stop using that resource if it already had it. There should be no logic of checking the error type and doing different things. Having these error types encourages that behavior.

We don't have to make that change as part of this PR, but I think you should add it to your list of things that you have that need to be done as part of the generic xDS client changes.

easwars avatar Apr 07 '25 17:04 easwars

Yes, I agree that we should not have checks for "resource-not-found" errors in the individual watchers. In fact, I'm thinking we might not even need those error types like xdsresource.ErrorTypeResourceNotFound anymore. If ResourceError is invoked, then the watcher must stop using the previous configuration and if AmbientError is invoked, it should continue using the previous configuration and the xDS client should guarantee that AmbientError is only invoked when a previous copy of the resource was delivered via ResourceChanged.

While making another pass through this, I really think we should get rid of these error types, or at least look into getting rid of them. The reason behind my proposal is that when a watcher gets a ResourceError, it should simply stop using that resource if it already had it. There should be no logic of checking the error type and doing different things. Having these error types encourages that behavior.

We don't have to make that change as part of this PR, but I think you should add it to your list of things that you have that need to be done as part of the generic xDS client changes.

yes, I have added. Thanks.

purnesh42H avatar Apr 07 '25 17:04 purnesh42H

the expected behavior, according to A88 is if resource is NACKed and is not cached, the watcher should drop the resource

If we are talking about the case where a NACK is seen without the resource being cached, then the question of not using the resource anymore does not come into the picture, because the resource does not exist yet.

yeah if watcher has not seen the valid resource before then its not applicable. The gRFC talks about case when watcher has seen the valid resource but it deleted in xDS client cache https://github.com/grpc/proposal/blob/master/A88-xds-data-error-handling.md#changes-to-xdsclient-watcher-apis.

purnesh42H avatar Apr 07 '25 17:04 purnesh42H