rfcs
rfcs copied to clipboard
RFC: health endpoint
Adding a proposal for basic health endpoint for Concourse cluster.
@alexbakar I think it might help you to elaborate on why this is beneficial beyond what /api/v1/info
already provides (which is what's currently used in the Helm chart liveness/readiness probes).
Thank you guys for the feedback. It's very useful. @ari-becker AFAIK the /api/v1/info provides an information for Concourse version, workers version, and the external URL. My idea was to have a way to get also the overall status of nodes. And as the endpoint is to be public (no authentication required), this information could only be very basic - db/web/workers and their status. @cirocosta Thanks, the explanation really makes sense. I agree the proposition I made is probably already covered by the /api/v1/workers (as there is plenty of information about the workers including their state) and conflicts with the auth requirements for it. I also like the idea of verifying the healthiness by sending specific workloads to it that would attest in an end-to-end fashion (slirunner). I have to think about these topics and reconsider my proposal. I will send my comments then.
@alexbakar we (the Concourse team) are focusing on paying more attention to RFCs and shepherding them to some form of resolution. I was wondering if you've had the chance to put any more thought into this/whether you're still interested in this topic?
Hi there,
I'd like to drop my thoughts here on this topic... :smile: I see two kinds of monitoring for two kinds of installations.
- A full-on, 8000 user, 800 pipeline, pedal to the metal, HA, 99.9999% uptime critical Concourse installation. (exaggeration on purpose 😉 )
- A simple, small, quick-n-dirty, 3 user, 5 pipeline, 98.0% uptime Concourse install.
For the first option, of course you'd want a full on monitoring system like Prometheus with alerts etc that execute specific workloads to test the various components and retrieves metrics.
However, for the second (and other) case(s), such a monitoring system would be overkill. Maybe you'd want to use a simple "ping" or "http content" check alerting tool. (think something like https://github.com/iloire/WatchMen)
For such an alerting tool, an equally simple health check would be greatly appreciated.
A clear endpoint for health/status checks: /api/v1/health
or /api/v1/status
A clear HTTP response code: 200 for all OKs, 503 in case one item is not OK, 500 if none are OK.
A clear and simple JSON response like:
{
"atc": "ok",
"db": "ok",
"workers": "ok",
"timestamp": 1619710516
}
If something were amiss with the workers for example, an admin could then zoom in by checking /api/v1/workers
.
Just some thoughts...