nr1-slo-r icon indicating copy to clipboard operation
nr1-slo-r copied to clipboard

Automatically create alerts based on error budget burn.

Open ghost opened this issue 4 years ago • 0 comments

Summary

Alerts have traditionally been focused on system behaviors that are symptomatic of user impact (e.g. high CPU, higher than normal throughput). With the adoption of SLOs, we can now measure user experience directly, and alert on directly on user pain (e.g. high error rates, slow transactions).

To make adoption of SLO-driven alerting easier, I would like a simple GUI interface that will automatically generate alert conditions and add them to a named policy without forcing application teams to manually create alerts.

Desired Behaviour

When creating or editing a SLO, users should be presented with an optional 'Configure Alerting' component. This component should allow users to configure alerts like this or by choosing pre-defined alerts from a dropdown. The user should be able to specify the name of an existing or new alerting policy which they have access to to add the condition(s) to when the SLO is saved.

Likewise, if a user modifies or removes alerts from a SLO, the corresponding conditions should also be removed from the alert policy by SLO/R.

Possible Solution

Discussion of alerting windows and urgencies:

  1. https://mads-hartmann.com/sre/2020/09/08/alerting-on-slos.html
  2. https://landing.google.com/sre/workbook/chapters/alerting-on-slos/

I am open to discussion on the best UX for this.

Additional context

We are planning to roll out SLOs to all critical applications by Q1 2021. A major component of this rollout is a gradual migration away from noisy system-based alerts to SLO-based alerts that only fire when users are experiencing impact. This will help save significant time, cost, and toil for our application teams, and make it more obvious where in a given dependency chain an impact is coming from.

Manually creating these alerts (never mind maintaining) is cost-prohibitive and would require SRE involvement to train every application team. This enhancement would allow us to set sane defaults and allow application teams to self-service the implementation piece of this rollout.

@khpeet

ghost avatar Oct 12 '20 16:10 ghost