xdsclient: update watcher API as per gRFC A88
This is the first part of implementing gRFC A88 (https://github.com/grpc/proposal/pull/466).
This introduces the new watcher API but does not change any of the existing behavior. This table summarizes the API changes and behavior for each case:
| Case | Old API | New API | Behavior |
|---|---|---|---|
| Resource timer fires | OnResourceDoesNotExist() |
OnResourceChanged(NOT_FOUND) |
Fail data plane RPCs |
| LDS or CDS resource deletion | OnResourceDoesNotExist() |
OnResourceChanged(NOT_FOUND) |
Drop resource and fail data plane RPCs |
| xDS channel reports TRANSIENT_FAILURE | OnError() |
OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise |
Continue using cached resource, if any; otherwise, fail data plane RPCs |
| ADS stream terminates without receiving a response | OnError() |
OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise |
Continue using cached resource, if any; otherwise, fail data plane RPCs |
| Invalid resource update (client NACK) | OnError() |
OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise |
Continue using cached resource, if any; otherwise, fail data plane RPCs |
| Valid resource update | OnUpdate(resource) |
OnResourceChanged(resource) |
use the new resource |
RELEASE NOTES:
- xds: TBD
Codecov Report
Attention: Patch coverage is 80.20305% with 39 lines in your changes missing coverage. Please review.
Project coverage is 82.21%. Comparing base (
ce35fd4) to head (c43cd62). Report is 9 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #7977 +/- ##
==========================================
+ Coverage 82.16% 82.21% +0.05%
==========================================
Files 410 412 +2
Lines 40248 40487 +239
==========================================
+ Hits 33068 33287 +219
- Misses 5830 5839 +9
- Partials 1350 1361 +11
| Files with missing lines | Coverage Δ | |
|---|---|---|
| internal/xds/bootstrap/bootstrap.go | 65.72% <ø> (ø) |
|
| xds/internal/balancer/cdsbalancer/cdsbalancer.go | 84.49% <100.00%> (-0.15%) |
:arrow_down: |
| ...s/internal/balancer/cdsbalancer/cluster_watcher.go | 100.00% <100.00%> (ø) |
|
| ...rnal/balancer/clusterresolver/resource_resolver.go | 94.44% <100.00%> (ø) |
|
| xds/internal/resolver/serviceconfig.go | 85.71% <100.00%> (ø) |
|
| xds/internal/resolver/watch_service.go | 100.00% <100.00%> (+8.57%) |
:arrow_up: |
| xds/internal/xdsclient/authority.go | 80.38% <100.00%> (+0.60%) |
:arrow_up: |
| xds/internal/xdsclient/clientimpl_watchers.go | 84.81% <100.00%> (+0.39%) |
:arrow_up: |
| ...nal/xdsclient/xdsresource/cluster_resource_type.go | 77.14% <100.00%> (ø) |
|
| ...l/xdsclient/xdsresource/endpoints_resource_type.go | 75.00% <100.00%> (ø) |
|
| ... and 8 more |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@easwars ptal
This came up during some other discussion with @dfawley.
- Callback methods named
OnXxxare more of a C++ style. In Go, we would probably just name themXxx. - Whether we should have two callback methods, one named
ResourceChanged(ResourceData), and the other namedResourceError(error). This is to work around the fact that we don't have aStatusOrtype in Go, and without that, having two APIs is better than one API (where we need to add a new struct to hold one of two possible values, with no guarantees that only one of the two values is set).
This came up during some other discussion with @dfawley.
- Callback methods named
OnXxxare more of a C++ style. In Go, we would probably just name themXxx.- Whether we should have two callback methods, one named
ResourceChanged(ResourceData), and the other namedResourceError(error). This is to work around the fact that we don't have aStatusOrtype in Go, and without that, having two APIs is better than one API (where we need to add a new struct to hold one of two possible values, with no guarantees that only one of the two values is set).
I will wait for https://github.com/grpc/grpc-go/pull/8042/ before making further changes to this PR since that has the discussion on decision
@easwars I have made the changes to match the final a88 watcher interface ResourceChanged, ResourceError and AmbientError including the conditions to call ResourceError or AmbientError based on cache.
Everything else is working fine but looks like this TestServeAndCloseDoNotRace is causing race https://github.com/grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test. I can't seem to figure how the watcher change is affecting this. Could you help take a look? Thanks.
@easwars I have made the changes to match the final a88 watcher interface
ResourceChanged,ResourceErrorandAmbientErrorincluding the conditions to callResourceErrororAmbientErrorbased on cache.Everything else is working fine but looks like this
TestServeAndCloseDoNotRaceis causing race https://github.com/grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test. I can't seem to figure how the watcher change is affecting this. Could you help take a look? Thanks.
Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?
@easwars I have made the changes to match the final a88 watcher interface
ResourceChanged,ResourceErrorandAmbientErrorincluding the conditions to callResourceErrororAmbientErrorbased on cache. Everything else is working fine but looks like thisTestServeAndCloseDoNotRaceis causing race https://github.com/grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test. I can't seem to figure how the watcher change is affecting this. Could you help take a look? Thanks.Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?
Actually, I thought over it and right thing to do here is handle this case in listener wrapper. Like previously done, we handle lds error only in case of resource-not-found. Let me know if that's okay.
Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?
If we start a watch for a resource and the authority of that resource is not found in the bootstrap config, then the right behavior is to report a non-ambient error to the watcher.
In general, we should never call OnAmbientError() for a watcher before we call OnResourceChanged(). If we get a transient error before we've seen a valid resource, then the transient error should be reported to OnResourceChanged() instead of OnAmbientError().
Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?
If we start a watch for a resource and the authority of that resource is not found in the bootstrap config, then the right behavior is to report a non-ambient error to the watcher.
In general, we should never call OnAmbientError() for a watcher before we call OnResourceChanged(). If we get a transient error before we've seen a valid resource, then the transient error should be reported to OnResourceChanged() instead of OnAmbientError().
yeah handled it in ListenerWrapper. Still calling ResourceError for channel not found
yeah handled it in ListenerWrapper
I haven't looked at the code, but that seems wrong to me. There should not need to be any special handling for individual resource types -- the logic for this in XdsClient should be completely resource-type-agnostic.
The only thing that should be handled in individual resource types are things that are different from other resource types, like determining what resource type to pass to the watcher.
yeah handled it in ListenerWrapper
I haven't looked at the code, but that seems wrong to me. There should not need to be any special handling for individual resource types -- the logic for this in XdsClient should be completely resource-type-agnostic.
The only thing that should be handled in individual resource types are things that are different from other resource types, like determining what resource type to pass to the watcher.
This error for channel not found while registering watch https://github.com/grpc/grpc-go/pull/7977/files#diff-b4a775482793516a8ed23bd6a58a805d316f5abb6468580691c2b0556e22606aR619 was recently added. Listener wrapper for grpc xds server had a logic to stop serving only if error is of type ResourceNotFound so i just kept that instead of stopping always https://github.com/grpc/grpc-go/pull/7977/files#diff-e4706c72ae912399b7f8ee6f04cec2374ef7a7679b12358f201ddb0b45e34146R453-R457. If we don't do that TestServeAndCloseDoNotRace is causing the race because we are creating and stopping grpc xds servers in a loop . @easwars do you think we should rewrite this test?
Listener wrapper for grpc xds server had a logic to stop serving only if error is of type ResourceNotFound so i just kept that instead of stopping always https://github.com/grpc/grpc-go/pull/7977/files#diff-e4706c72ae912399b7f8ee6f04cec2374ef7a7679b12358f201ddb0b45e34146R453-R457.
I don't think that check is correct -- it should be removed. If the XdsClient delivers an error, it should always be passed through to the watcher.
That's an inherent part of the API change we're making in this PR. The errors that we were previously ignoring in OnError() are now being passed to OnAmbientError() instead. If you are seeing an error reported to OnError(), that means we actually want to drop the resource we saw previously, if any.
Yes, I agree that we should not have checks for "resource-not-found" errors in the individual watchers. In fact, I'm thinking we might not even need those error types like xdsresource.ErrorTypeResourceNotFound anymore. If ResourceError is invoked, then the watcher must stop using the previous configuration and if AmbientError is invoked, it should continue using the previous configuration and the xDS client should guarantee that AmbientError is only invoked when a previous copy of the resource was delivered via ResourceChanged.
Everything else is working fine but looks like this TestServeAndCloseDoNotRace is causing race grpc/grpc-go/actions/runs/14091957805/job/39470559995?pr=7977. Server is stopping in the middle of the test.
Sorry, I don't have the cycles to debug this failure. But if you can dig a little deeper and get more info on the sequence of events leading to the failure, and why exactly the flake is happening, I might be able to help better. Once we have that information, we will be able to decide if we need to rewrite the test or if there is something wrong we are doing in the xDS client. Thanks.
Ah, look like it was failing because of calling ResourceError when channel not found in authority.go which was calling ListenerWrapper onLDSResourceError which was switching serving mode. Seems like that's unexpected behavior? Calling AmbientError fixed it. Is it correct to consider channel not found to be AmbientError while registering the watcher?
If we start a watch for a resource and the authority of that resource is not found in the bootstrap config, then the right behavior is to report a non-ambient error to the watcher.
In general, we should never call OnAmbientError() for a watcher before we call OnResourceChanged(). If we get a transient error before we've seen a valid resource, then the transient error should be reported to OnResourceChanged() instead of OnAmbientError().
@markdroth the case we are looking here is actually channel creation failure. In c++ and java that's never expected but in go recently a change was added to report a transient error if channel creation failed. Now, based on A88, this should be a non-ambient error only if we don't have a cached resource otherwise it should be an ambient error. Is that the correct behavior?
@markdroth the case we are looking here is actually channel creation failure. In c++ and java that's never expected but in go recently a change was added to report a transient error if channel creation failed. Now, based on A88, this should be a non-ambient error only if we don't have a cached resource otherwise it should be an ambient error. Is that the correct behavior?
Yes. But in practice I suspect this case will never happen when we have a cached resource, because the only cases where channel creation fails should be things like invalid target URIs, in which case the channel creation will never succeed for the life of the XdsClient, since the bootstrap config (which specifies the target URI) can't change after the XdsClient is created. And if the channel creation never succeeds, then there's no way we can have gotten a cached resource.
The only exception I can think of to that would be fallback, if channel creation is failing for the primary server but not for the secondary server. But in general, channel-level errors on the primary server should not be sent to the watcher in the first place if there's a secondary server to fall back to, so I think that's a non-issue.
@easwars the expected behavior, according to A88 is if resource is NACKed and is not cached, the watcher should drop the resource. So, in the case of xDS enabled gRPC server, if listener resource is not cached, server needs to drop the listener resource and should move to "not serving". Another thing to note is we don't suppress not_serving->not_serving transition and invoke the mode change callback each time.
So, i have modified the server tests cases accordingly,
- TestServer_Security_NoCertificateProvidersInBootstrap_FailureTest
- Server_Security_WithValidAndInvalidSecurityConfiguration In these, we need to drain the modeCh and errorCh channel of test mode change callback each time serving mode is switched but also make sure that it is always moving to "not serving".
For, TestResolverBadServiceUpdate, I have split into 2 parts. 1) without cached 2) with cached
For, TestServeAndCloseDoNotRace, override the default ServingModeCallback because the listener resource is with invalid uri which is always expected to cause resource error leading to serving mode change. And default ServingModeCallback logs an error in that case. For this test that's irrelevant.
With this I think we have verified how the existing functionality is affected. I will also import this PR in google3 and make sure all the prod resolvers including directpath are working as expected.
I don't understand what you mean by dropping the resource? Why should the listener watcher drop the resource if it sees the resource being NACKed when it was not cached previously? It should continue watching that resource in the hopes that a subsequent update for the same resource will be ACKed.
I don't understand what you mean by dropping the resource? Why should the listener watcher drop the resource if it sees the resource being NACKed when it was not cached previously? It should continue watching that resource in the hopes that a subsequent update for the same resource will be ACKed.
oh by dropping the resource it doesn't mean stop watching the resource. It just means that if the consumer of watcher was using the resource, they should stop using it. So, in the case of server it leads to serving mode change to "not serving" if listener resource is nacked but watcher will still have the watch on for the listener resource.
the expected behavior, according to A88 is if resource is NACKed and is not cached, the watcher should drop the resource
If we are talking about the case where a NACK is seen without the resource being cached, then the question of not using the resource anymore does not come into the picture, because the resource does not exist yet.
Yes, I agree that we should not have checks for "resource-not-found" errors in the individual watchers. In fact, I'm thinking we might not even need those error types like
xdsresource.ErrorTypeResourceNotFoundanymore. IfResourceErroris invoked, then the watcher must stop using the previous configuration and ifAmbientErroris invoked, it should continue using the previous configuration and the xDS client should guarantee thatAmbientErroris only invoked when a previous copy of the resource was delivered viaResourceChanged.
While making another pass through this, I really think we should get rid of these error types, or at least look into getting rid of them. The reason behind my proposal is that when a watcher gets a ResourceError, it should simply stop using that resource if it already had it. There should be no logic of checking the error type and doing different things. Having these error types encourages that behavior.
We don't have to make that change as part of this PR, but I think you should add it to your list of things that you have that need to be done as part of the generic xDS client changes.
Yes, I agree that we should not have checks for "resource-not-found" errors in the individual watchers. In fact, I'm thinking we might not even need those error types like
xdsresource.ErrorTypeResourceNotFoundanymore. IfResourceErroris invoked, then the watcher must stop using the previous configuration and ifAmbientErroris invoked, it should continue using the previous configuration and the xDS client should guarantee thatAmbientErroris only invoked when a previous copy of the resource was delivered viaResourceChanged.While making another pass through this, I really think we should get rid of these error types, or at least look into getting rid of them. The reason behind my proposal is that when a watcher gets a
ResourceError, it should simply stop using that resource if it already had it. There should be no logic of checking the error type and doing different things. Having these error types encourages that behavior.We don't have to make that change as part of this PR, but I think you should add it to your list of things that you have that need to be done as part of the generic xDS client changes.
yes, I have added. Thanks.
the expected behavior, according to A88 is if resource is NACKed and is not cached, the watcher should drop the resource
If we are talking about the case where a NACK is seen without the resource being cached, then the question of not using the resource anymore does not come into the picture, because the resource does not exist yet.
yeah if watcher has not seen the valid resource before then its not applicable. The gRFC talks about case when watcher has seen the valid resource but it deleted in xDS client cache https://github.com/grpc/proposal/blob/master/A88-xds-data-error-handling.md#changes-to-xdsclient-watcher-apis.