rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

RFC: health endpoint

Open alexbakar opened this issue 4 years ago • 4 comments

Adding a proposal for basic health endpoint for Concourse cluster.

alexbakar avatar Mar 07 '20 13:03 alexbakar

@alexbakar I think it might help you to elaborate on why this is beneficial beyond what /api/v1/info already provides (which is what's currently used in the Helm chart liveness/readiness probes).

ari-becker avatar Mar 07 '20 17:03 ari-becker

Thank you guys for the feedback. It's very useful. @ari-becker AFAIK the /api/v1/info provides an information for Concourse version, workers version, and the external URL. My idea was to have a way to get also the overall status of nodes. And as the endpoint is to be public (no authentication required), this information could only be very basic - db/web/workers and their status. @cirocosta Thanks, the explanation really makes sense. I agree the proposition I made is probably already covered by the /api/v1/workers (as there is plenty of information about the workers including their state) and conflicts with the auth requirements for it. I also like the idea of verifying the healthiness by sending specific workloads to it that would attest in an end-to-end fashion (slirunner). I have to think about these topics and reconsider my proposal. I will send my comments then.

alexbakar avatar Mar 11 '20 07:03 alexbakar

@alexbakar we (the Concourse team) are focusing on paying more attention to RFCs and shepherding them to some form of resolution. I was wondering if you've had the chance to put any more thought into this/whether you're still interested in this topic?

aoldershaw avatar Apr 19 '21 14:04 aoldershaw

Hi there,

I'd like to drop my thoughts here on this topic... :smile: I see two kinds of monitoring for two kinds of installations.

  1. A full-on, 8000 user, 800 pipeline, pedal to the metal, HA, 99.9999% uptime critical Concourse installation. (exaggeration on purpose 😉 )
  2. A simple, small, quick-n-dirty, 3 user, 5 pipeline, 98.0% uptime Concourse install.

For the first option, of course you'd want a full on monitoring system like Prometheus with alerts etc that execute specific workloads to test the various components and retrieves metrics.

However, for the second (and other) case(s), such a monitoring system would be overkill. Maybe you'd want to use a simple "ping" or "http content" check alerting tool. (think something like https://github.com/iloire/WatchMen)

For such an alerting tool, an equally simple health check would be greatly appreciated.

A clear endpoint for health/status checks: /api/v1/health or /api/v1/status A clear HTTP response code: 200 for all OKs, 503 in case one item is not OK, 500 if none are OK. A clear and simple JSON response like:

{
  "atc": "ok",
  "db": "ok",
  "workers": "ok",
  "timestamp": 1619710516
}

If something were amiss with the workers for example, an admin could then zoom in by checking /api/v1/workers.

Just some thoughts...

mvdkleijn avatar Apr 29 '21 15:04 mvdkleijn