grpc-go icon indicating copy to clipboard operation
grpc-go copied to clipboard

Cluster resolver should produce better errors when there are no healthy endpoints

Open dfawley opened this issue 1 year ago • 9 comments

EDIT: See https://github.com/grpc/grpc-go/issues/7749#issuecomment-2452692541 -- (1) below is not correct. (2) should still be fixed.

If there are no healthy addresses, clutserresolver is just passing an empty address list to the child LB policy:

https://github.com/grpc/grpc-go/blob/4544b8a4cfe9bb4882ec3591631b83dac7434805/xds/internal/balancer/clusterresolver/configbuilder.go#L258-L290

This means two things:

  1. The child policy will just ignore this update if it previously had healthy addresses, and
  2. The error that comes out, if the child policy never had valid addresses, says something about a resolver error, but it should say something like "cluster has no healthy endpoints".

Both of these should be considered bugs.

cc @markdroth @ejona86

dfawley avatar Oct 16 '24 18:10 dfawley

The child policy will just ignore this update if it previously had healthy addresses

Yeah, that definitely seems like a bug in the child policy, even completely independent of xDS. I think the child policy should report TF and its Update() method should return non-OK, thus triggering the resolver to re-resolve.

The error that comes out, if the child policy never had valid addresses, says something about a resolver error, but it should say something like "cluster has no healthy endpoints".

Yeah, the more useful we can make the error message, the better. I did a bunch of work for cases like this in C-core as part of https://github.com/grpc/grpc/issues/22883. The way we're handling this is:

  • The resolver API allows the resolver to return a resolution_note string that provides human-readable context that can be used in cases where it returns a valid address list and service config that the LB policy may not like in combination (e.g., if the address list is empty).
  • For EDS, we set the resolution_note to indicate when there's a problem. For example, we set if when the EDS resource contains no localities or when we get an error for the EDS resource.
  • All of our parent policies propagate this field through updates to children as appropriate. (Prior to implementing gRFC A75, there was a case where we needed to concatenate resolution_note strings from different clusters for aggregate clusters, but that's no longer necessary after the changes from that design.)
  • In both pick_first and our petiole policies (e.g., round_robin), if we get a valid but empty address list, we include the resolution_note in the error message we return.

markdroth avatar Oct 16 '24 19:10 markdroth

Yeah, that definitely seems like a bug in the child policy, even completely independent of xDS. I think the child policy should report TF and its Update() method should return non-OK, thus triggering the resolver to re-resolve.

PF will return a non-OK result and that will trigger a re-resolution, but the point here is that the policy will keep using the old addresses, which is the defined behavior of PF.

dfawley avatar Oct 16 '24 19:10 dfawley

The way we're handling this is:

I was expecting this needs to be handled by some custom logic in clusterresolver that detects when there are no addresses and substitutes the child picker for another one with a "no endpoints" error.

If you're pushing the EDS resolver update to the child policy, then how do you avoid the same situation where we keep using old addresses when zero address errors are encountered?

dfawley avatar Oct 16 '24 19:10 dfawley

I think the child policy should report TF

Oops, I guess I misread this part the first time. I don't think the child policy is supposed to report TF in these cases, right? IIUC, PF is supposed to treat a resolver result of 0 addresses the same as a resolver error, which we ignore if we were already READY.

dfawley avatar Oct 16 '24 19:10 dfawley

If you remember back when we agreed on the "always trust the control plane" principle (internally, see go/grpc-client-channel-principles-revamp), we decided that if the control plane sends us zero addresses, then we need to honor that immediately. Our currently agreed behavior of record is that PF should report TF in this case. It should not ignore the update if it was already READY -- we do that only if we can an error (as opposed to a valid but empty address list).

As you know, I'm not a big fan of the "always trust the control plane" principle, and I would still like us to reevaluate whether that's the right thing, especially now that we have more o11y infrastructure in place in OSS. But until we do that, the above is the behavior that we should be providing in all languages.

markdroth avatar Oct 16 '24 21:10 markdroth

The child policy will just ignore this update if it previously had healthy addresses, and

clusterresolver's child policy is priority, and if priority receives an update with no addresses, it should error saying no priority is provided, all priorities are removed.

Is that not what are you are seeing?

Where did you run into this? And what exact error were you seeing?

I did check our codebase, and looks like we do not have an e2e style test for a scenario where the clusterresolver receives endpoints, all of which are unhealthy. We should add a test for it, for sure.

easwars avatar Oct 21 '24 16:10 easwars

Where did you run into this? And what exact error were you seeing?

It's a theoretical case that I quickly modified the existing clusterresolver e2e test of TestEDS_EndpointsHealth to stimulate. The error it gives when you set all addresses to UNHEALTHY is rpc error: code = Unavailable desc = last resolver error: produced zero addresses, which is what I mention in (2) and implies (1) is likely true, but it's technially possible something else happens.

dfawley avatar Oct 22 '24 15:10 dfawley

Also, an update: we may actually want to change the expected behavior of some of these things. I'll follow up here when I know move.

dfawley avatar Oct 22 '24 15:10 dfawley

It looks like my understanding of our PF implementation was wrong, and that (1) is false. We will accept a zero-address update from the resolver and apply it, making future RPCs fail. This was implemented in #5274. Sorry for the mistake.

We still need to improve our error messages when a cluster has no routable endpoints, however.

dfawley avatar Nov 01 '24 22:11 dfawley