boulder icon indicating copy to clipboard operation
boulder copied to clipboard

Fix our grpc_health_v1 implementation

Open beautifulentropy opened this issue 9 months ago • 3 comments

In 2020 we imported grpc-go's health_v1 implementation, and began checking this status in startservers.py: https://github.com/letsencrypt/boulder/pull/5074

In 2023 we added support for deep checks to all boulder services and actually implemented one in the SA: https://github.com/letsencrypt/boulder/pull/6928

It appears that we missed one aspect of fully enabling health checking in grpc-go (link)

// startHealthCheck starts the health checking stream (RPC) to watch the health
// stats of this connection if health checking is requested and configured.
//
// LB channel health checking is enabled when all requirements below are met:
// 1. it is not disabled by the user with the WithDisableHealthCheck DialOption
// 2. internal.HealthCheckFunc is set by importing the grpc/health package
// 3. a service config with non-empty healthCheckConfig field is provided
// 4. the load balancer requests it
//
// It sets addrConn to READY if the health checking stream is not started.
//
// Caller must hold ac.mu.

But, we do not set a healthCheckConfig in our client setup

A grpc-go server will promote itself to SERVING automatically. But, without setting the healthCheckConfig the client doesn't begin checking the health status of server endpoints and thus does not rely on this health status when determining whether a given endpoint should receive traffic.

The load balancer uses the specific service name from the healthCheckConfig.serviceName in the client’s service config.

So if you set:

"healthCheckConfig": {"serviceName": ""}

it will query the overall health ("").

If you set:

"healthCheckConfig": {"serviceName": "a.service.Name"}

it will query the per-service health for "a.service.Name".

Once we set a healthCheckConfig, the load balancer then routes traffic only to endpoints reporting SERVING for that specific service. If no status is set for the requested service, the server returns NOT_FOUND, and the client marks that endpoint as TRANSIENT_FAILURE.

For most of our services setting and checking the overall health will be fine, but when we wrote grpc.initLongRunningCheck() we actually passed a service in and began setting health for that specific service instead of overall health. We should consider removing that argument and instead setting overall health using the deep health checks. If we make this change, SRE will need to be given some time to update the Consul health checks for the SA, which rely on the service specific check.

beautifulentropy avatar Jun 03 '25 23:06 beautifulentropy

For historical reference, I don't think this is a bug: we added health-checking experimentally, as something we could use in startservers that's better than pinging the debug port. We knew that we weren't actually using it between our gRPC services in prod, and that was intentional.

That said, I think this is a great idea and it's appropriate to turn this on now.

aarongable avatar Jun 04 '25 00:06 aarongable

For historical reference, I don't think this is a bug: we added health-checking experimentally, as something we could use in startservers that's better than pinging the debug port. We knew that we weren't actually using it between our gRPC services in prod, and that was intentional.

That said, I think this is a great idea and it's appropriate to turn this on now.

In https://github.com/letsencrypt/boulder/pull/5093 the description says:

This change also imports the health client into our grpc client, ensuring that all of our grpc clients use the health service to inform their load-balancing behavior.

In @jsha's review of this change he says:

I believe this actually changes the behavior of gRPC's client-side LB (which we use), such that it relies on the health check service rather than simple connectivity to judge whether a backend is up: https://pkg.go.dev/google.golang.org/grpc/internal.

Can you clarify what was meant by these comments or whether something has changed since then?

beautifulentropy avatar Jun 04 '25 00:06 beautifulentropy

Looks like my recollection was wrong! Thanks for the correction.

aarongable avatar Jun 04 '25 01:06 aarongable