3scale-operator icon indicating copy to clipboard operation
3scale-operator copied to clipboard

Add SOPs/runbooks for all Prometheus critical alerts

Open obrienrobert opened this issue 5 years ago • 2 comments

Enabling monitoring in 2.9 adds a number of new Prometheus alerts, many of which are of critical severity. Before we (the RHMI team) can enable monitoring, we have to ensure that these critical alerts meet the acceptance criteria outlined by CS-SRE, one of which states that all critical alerts must have an accompanying SOP/run-book (linked via an annotation in the alert) that includes detailed steps on how to resolve/investigate the alert.

We have many examples of SOP's that were created for the critical alerts within RHMI. If needed, these could be shared to show the kind of detail each one contains, and how they are structured.

obrienrobert avatar Sep 07 '20 21:09 obrienrobert

@miguelsorianod

obrienrobert avatar Sep 07 '20 21:09 obrienrobert

@obrienrobert Lets discuss in our API Management Service - PM/ENG Sync. Call this Wednesday. And then subsequently create a Jira.

thomasmaas avatar Sep 08 '20 07:09 thomasmaas

Closing this as RHMI has its own SOP's for SRE.

austincunningham avatar Nov 23 '23 14:11 austincunningham