teleport icon indicating copy to clipboard operation
teleport copied to clipboard

`/readyz` endpoint returns 200 OK when not all enabled services are running

Open programmerq opened this issue 8 months ago • 1 comments

What would you like Teleport to do?

Introduce a new health-check endpoint (or modify the existing /readyz endpoint) that provides a 200 OK response only if all enabled services in the configuration are up and running without errors.

What problem does this solve?

Currently, the /readyz endpoint returns a 200 OK status as soon as the instance successfully heartbeats with the cluster.

This means that if one or more of the configured Teleport services (e.g., app_service) is not yet ready after, or never starts up properly, /readyz still returns a 200 OK. This is true as long as it was able to do a heartbeat of any kind.

A repeatable method to force a successful heartbeat, but have a broken service is to enable both the ssh_service and the app_service, and then try to join the cluster with a token that is good for the app role only. The app service starts up, the instance heartbeats, but the ssh_service never becomes healthy, all while /readyz returns 200 OK.

If a workaround exists, please include it.

I looked over the /metrics endpoint, hoping that health/status info for each service might be there, but it wasn't. There doesn't appear to be a good way to determine the readiness based on the status of the individual Teleport services.

/healthz will always return a 200 if the process is running. If it is determined that the current behavior of readyz should not be altered, an additional endpoint with the desired behavior would be great.

programmerq avatar Jun 24 '24 21:06 programmerq