alertmanager
alertmanager copied to clipboard
Feature Request: Send resolved immediatly if all alerts in a group are not firing anymore
Given send_resolved is active on a given alert-route. If all alerts in a group are resolved the "final resolved notification" is sent the next time the group_wait expires. It would be nice if in this case the group_wait is skipped and the "final resolved notification" is sent immediatly.
The motivation is to have faster feedback that the countermeasures that have been taken to adress the issue causing the alerts to fire have been successful.
This is in direct opposition to the purpose of group_interval, which is to throttle the rate of notifications from from a group. This would be like asking for an instant notification when there was a new alert. You can always reduce group_interval if you want to know faster.
Also, that the alerts have stopped firing does not mean that the issue has resolved.
On Sat 11 Aug 2018, 20:40 Christoph Maser, [email protected] wrote:
Given send_resolved is active on a given alert-route. If all alerts in a group are resolved the "final resolved notification" is sent the next time the group_wait expires. It would be nice if in this case the group_wait is skipped and the "final resolved notification" is sent immediatly.
The motivation is to have faster feedback that the countermeasures that have been taken to adress the issue causing the alerts to fire have been successful.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/prometheus/alertmanager/issues/1510, or mute the thread https://github.com/notifications/unsubscribe-auth/AGyTdtAeYyDan49-bLNYVy2a_wbdJHgMks5uPyUDgaJpZM4V5RJK .
The the number of notifications should not change at all. Unless some corner where case your servicies in the same alert group are flapping in such way that the would go all to non-alerting and back to alerting before group_wait expires. If you have such a situation you have done something severely wrong anyway. Basically with this proposed feature you might shorten the time notifications for a given alert group are sent but not their number, so where is the harm? Comparing this to sending the first alert immediatly is not correct IMHO.
Basically with this proposed feature you might shorten the time notifications for a given alert group are sent but not their number,
That does affect their number, as the time between them would be reduced. If the resolution is just a flap for example.
so where is the harm?
Spamming the oncall is to be avoided.
Comparing this to sending the first alert immediatly is not correct IMHO.
I think you're confusing group_wait and group_interval. The relevant setting here is group_interval.
Like i laid out flapping with resolving all allerts on a group should beconsidered a major flaw in your setup so I in my mind raising the number of notifications is not a thing. Also this raises the question about a flap detection feature :smile:
Like i laid out flapping with resolving all allerts on a group should beconsidered a major flaw in your setup so I in my mind raising the number of notifications is not a thing.
This could be caused by a network problem, or a Prometheus restart. It's not an unexpected occurrence.
Also this raises the question about a flap detection feature 
You're proposing to remove part of the one we have.
Yes I meant group_interval not group wait. Yes a prometheus restart could trigger this, I did not think about this case. Shouldn't that be mitigated in part by by https://github.com/prometheus/prometheus/pull/4061
About the flap detection, isn't the current way of doing things more a flap-hiding. Potentially there could be flapping alerts and the user sees a continous problem instead of beeing notified about the flapping going on.
Shouldn't that be mitigated in part by
Yes, one of the main ways it could happen should happen a lot less often.
About the flap detection, isn't the current way of doing things more a flap-hiding.
Yes, it's more like hysteresis. By the time the alert is firing we're past the point of considering flaps (alerts thresholds are meant to be set so hitting them once is severe enough to cause a notification).
Potentially there could be flapping alerts and the user sees a continous problem instead of beeing notified about the flapping going on.
We don't care about flaps per-se. We care about there being a bad state that the oncall needs to know about. We also care about not spamming that oncall, and all the group_wait/group_interval logic we have here is to increase the signal to noise ratio.
If you want fast feedback that your mitigations/fixes are taking effect you will best get that from dashboards, not the alertmanager which will take potentially minutes to hours to pick that up depending on the nature of the system and alert.
I also opened a discussion around the same topic on the google group users. Basically what we are trying to achieve is to remove every updates and simply forward the first firing event (after group_wait) and the "full resolved" event.
This looks impossible right now. I would like to have on top of send_resolved, a send_updates boolean?
Also setting a group_interval very long (5y for instance) is removing every new alerts of the same group even if they were fully resolved in between. To me this is not expected...
This is in direct opposition to the purpose of group_interval, which is to throttle the rate of notifications from from a group. This would be like asking for an instant notification when there was a new alert. You can always reduce group_interval if you want to know faster. Also, that the alerts have stopped firing does not mean that the issue has resolved. … On Sat 11 Aug 2018, 20:40 Christoph Maser, @.***> wrote: Given send_resolved is active on a given alert-route. If all alerts in a group are resolved the "final resolved notification" is sent the next time the group_wait expires. It would be nice if in this case the group_wait is skipped and the "final resolved notification" is sent immediatly. The motivation is to have faster feedback that the countermeasures that have been taken to adress the issue causing the alerts to fire have been successful. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1510>, or mute the thread https://github.com/notifications/unsubscribe-auth/AGyTdtAeYyDan49-bLNYVy2a_wbdJHgMks5uPyUDgaJpZM4V5RJK .
I feel both of your statements are misunderstanding and mischaracterizing the spirit of the request and logic behind it. Please reconsider. Settings exist because of ideas/needs like the OP's and should be more seriously considered.
Zooming out of the send_resolved vs group_interval debate, it makes sense and will save pain if a problem condition has ended to have that truth or likelihood supersede contrary outdated information -- and the sooner this happens the better, period, this is inarguable. What could be argued is how best to achieve accurate information winning over inaccurate information as soon as possible. Outdated information indicating a continued error state is rendered inaccurate by the data indicating the condition has returned to normal -- we want to limit the time inaccurate data is allowed to live and be acted upon. At this point send_resolved seems to be the only feature to represent the fact that prometheus believes an error state has resolved itself. If an error state is resolved, yet that has not superseded the error state yet, prior issued alerts remain inaccurate and downstream systems (pager duty, automatic remediations, etc) are operating on inaccurate information. The point here is trying to have accurate information prevail quicker, and shorten the window of systems and people living in ignorance with inaccurate information. The debate of group_interval vs send_resolved and whether waiting intervals are being respected is a sideshow to the true topic here.
- To my knowledge, send_resolved is not triggered unless prometheus deems the alert resolved based on the data it's scraping. As far as I know, this is completely different than the point you state "alerts have stopped firing does not mean that the issue has resolved.". send_resolved is generated because prometheus has scrape data indicating the error causing condition has ended -- completely different than prometheus simply not sending more alerts about the problem and that being perceived as the problem being resolved.
- send_resolved is 'special' and a special case singular alert, I think the Op's request makes sense to allow new knowledge that a condition has resolved to be allowed to bypass group_interval, and have the send_resolved alert supersede all other alerts and in turn ones automated remediations. group_interval may assist with overly chatty alerts, however send_resolved is unlikely to be one of those or contribute to the 'too many alerts' dynamic that group_interval mitigates.
send_resolved being in direct opposition of group_interval is the whole feature request here and the whole point.
** I'm assuming group_interval is the delay-wall that would apply to send_resolved as I imagine send_resolved would be considered a subsequent alert to previous alerts and group_interval would apply. If I'm wrong and group_wait is actually the delay-wall for send_resolved please re-interpret may statements to that effect. I think group_interval would apply if send_resolved is considered a new alert and not part of the group.
.
so where is the harm?
Spamming the oncall is to be avoided.
In conditions where an error has resolved itself, the on-call's likely to get more notifications and be 'more awake' longer unnecessarily because alerts are showing systems being down and downstream alert systems like pager-duty will continue on their alert cycles because the send_resolved has to wait a full period to take effect. If the send_resolved could be issued sooner it could negate alerts that would wake up on-call under the current design.
I was woken up last night due to a "Prometheus all targets missing" error that resolved itself by the time I was at keys. In my desire to not have that happen again I researched the send_resolved feature. The only solution I could think of is send_resolved being sent right away, I researched how to make send_resolved take effect immediately or at least faster, and found this thread.
No matter how short or long group_interval is, send_resolved will always be late to the party because it has to wait outside to get in.
Would including 'status'
in the group_by
field help with this? Would that treat "resolved" messages as being in a different group to "firing" messages, and thus avoid the grouping delay from the initial alert.
@OrlandoArcapix Thats exactly our situation. "resolved" message should get delivered faster. In our case in 5 minutes, not having to wait for the group_interval: 6h. Any update on this issue?
Hi, any update on this issue? Getting resolved messages immediately would be great.
hi, any update on this issue?