RTX
RTX copied to clipboard
Set up a /healthcheck endpoint that can be monitored
Based on today's AHM discussion: It would nice to have something like a /healthcheck endpoint that could report substantial problems and could be monitored.
So for example,
- When /healthcheck is called, it could make sure that the KP info cache is less than 30 minutes old
- There aren't a bunch of stale active processes
- Relay any major errors*
It can start small, but ideally be flexible so that we could add more health checks, too.
Footnote* I have often mused about somehow have a response.error() option that is something like tell_a_human=True or something, where not only did the processing end in error, but this condition really ought to be relayed to an administrator rather than buried in a log file that no one is likely to read.
This would be nice for the ITRB endpoints in particular, where we can't even log in to poke around.