slogen
slogen copied to clipboard
Burnrate alerts aren't working correctly
I have an SLO that is 30m (short window) and 6h (long window). I've put the threshold the same on both.
When the SLO was triggered, it was quite quick (within 5m) but the alert took 6 hours to resolve after it went back to normal.
I would have expected it to be resolved quickly according to https://sre.google/workbook/alerting-on-slos/
Looking into this a bit deeper, I think that the threshold values on the monitor take 6 hours to evaluate, and it might not be possible to do "Multiwindow, Multi-Burn-Rate Alerts" using sumologic's monitors.
Also, I have 3 alerts associated with an SLO: 10m-1h, 30m-6h, and 6h-24h. In prometheus, the alerts aren't duplicated because they're grouped together (as you can see in the query in the article), but in Sumo I got 3 emails per SLO while the system was down.
Looking into this a bit more thoroughly, it looks like the monitor is being evaluated over the long period, and if the combined_burn exceeds the value of 1, anytime in that period it won't resolve. This would mean that it would have to be 1 or lower, for the long period of time.
I think we might have to change the monitor to be evaluated over the short period of time, but move the calculations for the combined_burn into a scheduled search so that it can be evaluated over a period of time.
It looks like a scheduled search wouldn't do it, but a scheduled view would. You can pre-populate the scheduled view with the current longBurnRate
, and then calculate the latestBurnRate
in the monitor.
Also, I've noticed that I am using the trigger for "Warning" and "ResolvedWarning" which is tripped when the combined_burn
exceeds 1. The "Critical" and "ResolvedCritical" seem to trip when the combined_burn
exceeds 2 but this will never happen, as it can only equal 2:
if (longBurnRate > 6 , 1,0) as long_burn_exceeded
| if ( latestBurnRate > 6, 1,0) as short_burn_exceeded
| long_burn_exceeded + short_burn_exceeded as combined_burn
Also, looking into the https://sre.google/workbook/alerting-on-slos/ more, it seems that they combine alerts based on the notification type.
For example:
expr: (
job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001)
and
job:slo_errors_per_request:ratio_rate5m{job="myjob"} > (14.4*0.001)
)
or
(
job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001)
and
job:slo_errors_per_request:ratio_rate30m{job="myjob"} > (6*0.001)
)
severity: page
This query means that both SLO alerts are combined. If either one is triggered, it will send the same email. This has the benefit that there won't be 2 notifications that the alert has been triggered, and there won't be a duplication of alerts.
I think it might be worthwhile updating the SLO configuration to the latest OpenSLO Spec. They have added a few objects such as "AlertPolicies" which have 1 or more "Alert Conditions". This would allow the configuration to group all of the "long/short burn rate" conditions into 1 alert.
Ah dam, it looks like OpenSLO oslo doesn't support the latest OpenSLO Spec.
https://github.com/OpenSLO/oslo/issues/63
hey @lswith, i will discuss the monitor not resolving with monitors team and get back on it by tomorrow. I recall it was to prevent frequent flapping b/w alert opening and closing but waiting for 6h defeats the purpose of a multi-window monitor.
the update to oslo is currently blocked for two reasons : 1) they haven't updated oslo and 2) it doesn't support multi burn rate monitors yet. i will discuss this with openslo team and will try to expedite it with raising a pr for oslo.
the monitor team is working on adding configurable resolution window for monitors, after that setting the resolve window to the short-burn period will give us the correct behaviour required for these alerts. The ETA for this feature is end of march.
cc: @tarunk2