public-roadmap
public-roadmap copied to clipboard
Alert Enhancement: x failures in y minutes
Just wondering if you would consider implementing an alert rule for "x failures in y minutes"? My reason here is that our applications are load balanced between web servers, so it might fail once on a bad server, but the next check is OK because it hits a different server.
We'd like to see if theres the option to have a "X fails in Y minutes" so that we can filter out the random failures (Alert fatigue), and only get alerted when there's frequent failures.
Thanks!
@rmsral thanks for contributing. If I understand you correctly, I think we already have that feature. Have a look at your alert setting.
Or let me know if this is not sufficient.
Good morning Tim. Unfortunately the check you speak of resets the count if the alert returns a healthy result (For instance, if 1 server is bad inside a farm of servers, it might hit a bad server 1 every 3 times).
I'm hoping for a feature that returns say, 5 failures within a 5 minute period, even if there's some healthy results in between (correlating with some healthy servers vs 1 bad server in a load balanced environment).
Hope that makes sense!
Hi @rmsral I understand the request now. At this point we do not have this feature — as you noticed — but I think the request is fair. In my mind, this feature would trigger some form of degradation, like we have right now for API checks when things are slow, but not broken. This case is similar: the service is degraded or "flapping" to use the good old Nagios term.
I cannot promise we will have this feature soon, but I want to take a closer look at it to determine of we have the basic data (which I think we do) to trigger such a state.