grpc-go
grpc-go copied to clipboard
Cluster resolver should produce better errors when there are no healthy endpoints
EDIT: See https://github.com/grpc/grpc-go/issues/7749#issuecomment-2452692541 -- (1) below is not correct. (2) should still be fixed.
If there are no healthy addresses, clutserresolver is just passing an empty address list to the child LB policy:
https://github.com/grpc/grpc-go/blob/4544b8a4cfe9bb4882ec3591631b83dac7434805/xds/internal/balancer/clusterresolver/configbuilder.go#L258-L290
This means two things:
- The child policy will just ignore this update if it previously had healthy addresses, and
- The error that comes out, if the child policy never had valid addresses, says something about a resolver error, but it should say something like "cluster has no healthy endpoints".
Both of these should be considered bugs.
cc @markdroth @ejona86
The child policy will just ignore this update if it previously had healthy addresses
Yeah, that definitely seems like a bug in the child policy, even completely independent of xDS. I think the child policy should report TF and its Update() method should return non-OK, thus triggering the resolver to re-resolve.
The error that comes out, if the child policy never had valid addresses, says something about a resolver error, but it should say something like "cluster has no healthy endpoints".
Yeah, the more useful we can make the error message, the better. I did a bunch of work for cases like this in C-core as part of https://github.com/grpc/grpc/issues/22883. The way we're handling this is:
- The resolver API allows the resolver to return a
resolution_notestring that provides human-readable context that can be used in cases where it returns a valid address list and service config that the LB policy may not like in combination (e.g., if the address list is empty). - For EDS, we set the
resolution_noteto indicate when there's a problem. For example, we set if when the EDS resource contains no localities or when we get an error for the EDS resource. - All of our parent policies propagate this field through updates to children as appropriate. (Prior to implementing gRFC A75, there was a case where we needed to concatenate
resolution_notestrings from different clusters for aggregate clusters, but that's no longer necessary after the changes from that design.) - In both pick_first and our petiole policies (e.g., round_robin), if we get a valid but empty address list, we include the
resolution_notein the error message we return.
Yeah, that definitely seems like a bug in the child policy, even completely independent of xDS. I think the child policy should report TF and its Update() method should return non-OK, thus triggering the resolver to re-resolve.
PF will return a non-OK result and that will trigger a re-resolution, but the point here is that the policy will keep using the old addresses, which is the defined behavior of PF.
The way we're handling this is:
I was expecting this needs to be handled by some custom logic in clusterresolver that detects when there are no addresses and substitutes the child picker for another one with a "no endpoints" error.
If you're pushing the EDS resolver update to the child policy, then how do you avoid the same situation where we keep using old addresses when zero address errors are encountered?
I think the child policy should report TF
Oops, I guess I misread this part the first time. I don't think the child policy is supposed to report TF in these cases, right? IIUC, PF is supposed to treat a resolver result of 0 addresses the same as a resolver error, which we ignore if we were already READY.
If you remember back when we agreed on the "always trust the control plane" principle (internally, see go/grpc-client-channel-principles-revamp), we decided that if the control plane sends us zero addresses, then we need to honor that immediately. Our currently agreed behavior of record is that PF should report TF in this case. It should not ignore the update if it was already READY -- we do that only if we can an error (as opposed to a valid but empty address list).
As you know, I'm not a big fan of the "always trust the control plane" principle, and I would still like us to reevaluate whether that's the right thing, especially now that we have more o11y infrastructure in place in OSS. But until we do that, the above is the behavior that we should be providing in all languages.
The child policy will just ignore this update if it previously had healthy addresses, and
clusterresolver's child policy is priority, and if priority receives an update with no addresses, it should error saying no priority is provided, all priorities are removed.
Is that not what are you are seeing?
Where did you run into this? And what exact error were you seeing?
I did check our codebase, and looks like we do not have an e2e style test for a scenario where the clusterresolver receives endpoints, all of which are unhealthy. We should add a test for it, for sure.
Where did you run into this? And what exact error were you seeing?
It's a theoretical case that I quickly modified the existing clusterresolver e2e test of TestEDS_EndpointsHealth to stimulate. The error it gives when you set all addresses to UNHEALTHY is rpc error: code = Unavailable desc = last resolver error: produced zero addresses, which is what I mention in (2) and implies (1) is likely true, but it's technially possible something else happens.
Also, an update: we may actually want to change the expected behavior of some of these things. I'll follow up here when I know move.
It looks like my understanding of our PF implementation was wrong, and that (1) is false. We will accept a zero-address update from the resolver and apply it, making future RPCs fail. This was implemented in #5274. Sorry for the mistake.
We still need to improve our error messages when a cluster has no routable endpoints, however.