ol-infrastructure
ol-infrastructure copied to clipboard
Uptime monitoring of internal services
User Story
As a platform engineer I want to be notified when an internal service is not available so that it can be addressed before it impacts external users.
Description/Context
As we implement more services (particularly for Open edX) that are independently deployed and only accessed internally to our networks we need a maintainable, scalable, and robust method to determine the service uptime and alert when it starts to fail.
Acceptance Criteria
- [ ] We are able to alert when e.g. xqueue, forum, etc. are not operating properly
- [ ] We are able to use a simple and robust method to determine the availability of services without requiring bespoke, intimate knowledge of the various failure modes of a given service
- [ ] We are able to apply the uptime monitoring to new services easily and reliably
- [ ] We are able to reduce false positive alerts, particularly in off hours