Explicit error message when failing to create the health check stream due to failing per-RPC credentials calls
Use case(s) - what problem will this feature solve?
When health check is enabled (via "healthCheckConfig": {"serviceName": ""} in the service config) and the client fails to create a stream, the channel remains in CONNECTING state and no error message is output. This makes the task of troubleshooting problems caused by creating the health check watch stream hard, as there are neither logs nor RPC errors, and RPC typically fail with deadline exceeded (if a deadline is set).
The code that swallows errors when creating the health check Watch stream is here: https://github.com/grpc/grpc-go/blob/fbff2abb0f5d6857ef3dca10958ffc948c207bc0/health/client.go#L74-L78. Errors when creating a stream can come from:
- The transport is not ready or nil (https://github.com/grpc/grpc-go/blob/38a8b9a7057218c210a599d79f3e1d07f84d18c5/balancer_wrapper.go#L362-L365 and https://github.com/grpc/grpc-go/blob/e912015fd3f4aabdff6d6cf835e321c19a204afb/stream.go#L1233-L1236) IIUC it doesn't apply in this case, since the health check watch is only started when the transport is ready
- A CallOption returns an error (I don't think this can happen here, since we don't provide the ability to customize call options for health checks) https://github.com/grpc/grpc-go/blob/e912015fd3f4aabdff6d6cf835e321c19a204afb/stream.go#L1253-L1255
- We fail to get codec or compressor for the call (again not applicable to health check since gRPC controls that part)
- We fail to get a transport or create a stream on the transport (https://github.com/grpc/grpc-go/blob/e912015fd3f4aabdff6d6cf835e321c19a204afb/stream.go#L353-L358). This can happen when we fail to create header fields, e.g. the call to
GetRequestMetadatato get per-RPC credentials fails (that is the case I ran into).
Proposed Solution
Transition the subchannel to TRANSIENT_FAILURE if we failed to create the watch stream. Continue the retry loop.
Alternatives Considered
If changing the behavior of of health checks when per RPC credentials fail is not desirable, at least logging through channelz would be nice.
I provided an illustration for the problem in https://github.com/grpc/grpc-go/commit/e10aa04718af94163851fe0fe04878f907e10689.
I don't see any problems with the one line fix in https://github.com/grpc/grpc-go/commit/e10aa04718af94163851fe0fe04878f907e10689 that updates the channel state to TRANSIENT_FAILURE if steam creation fails. The error has been ignored since the code was first introduced in https://github.com/grpc/grpc-go/pull/2389 and there are no comments indicating why this is required. This may have been an oversight. All the tests pass with the fix.
As an optimization, we could use a boolean to store if we've already updated the subchannel state to TF and avoid causing multiple picker updates after each attempt.
@atollena will you be open to raising a PR with the proposed change?
I don't see any problems with the one line fix in e10aa04 that updates the channel state to TRANSIENT_FAILURE if steam creation fails. The error has been ignored since the code was first introduced in #2389 and there are no comments indicating why this is required. This may have been an oversight. All the tests pass with the fix.
As an optimization, we could use a boolean to store if we've already updated the subchannel state to TF and avoid causing multiple picker updates after each attempt.
@atollena will you be open to raising a PR with the proposed change?
Yes, I'll take care of it.
@atollena : Gentle ping on whether you are going to raise a PR for this? Thanks.
@atollena : Gentle ping on whether you are going to raise a PR for this? Thanks.
I don't have the bandwidth to work on this right now, sorry about it. I'd suggest we simply close this issue for now, this is a quirk that users can live with.
Thanks @atollena