ol-infrastructure icon indicating copy to clipboard operation
ol-infrastructure copied to clipboard

Uptime monitoring of internal services

Open blarghmatey opened this issue 2 years ago • 0 comments

User Story

As a platform engineer I want to be notified when an internal service is not available so that it can be addressed before it impacts external users.

Description/Context

As we implement more services (particularly for Open edX) that are independently deployed and only accessed internally to our networks we need a maintainable, scalable, and robust method to determine the service uptime and alert when it starts to fail.

Acceptance Criteria

  • [ ] We are able to alert when e.g. xqueue, forum, etc. are not operating properly
  • [ ] We are able to use a simple and robust method to determine the availability of services without requiring bespoke, intimate knowledge of the various failure modes of a given service
  • [ ] We are able to apply the uptime monitoring to new services easily and reliably
  • [ ] We are able to reduce false positive alerts, particularly in off hours

blarghmatey avatar Feb 14 '23 19:02 blarghmatey