alertmanager
alertmanager copied to clipboard
Feature request - Matcher continue on receiver failure
Setup
We have 2 Slack receivers: r1
and r2
.
We have routes with matchers like so:
- matchers:
receiver: r2
routes:
- matchers:
- namespace!=""
receiver: r1
- matchers:
receiver: r2
(sorry for formatting, can't get it right, you'll get the idea no?)
So what this does is:
- Have a main fallback to
r1
in case nothing matches. However in this example, it should never hit that due to the last matcher always matching - Have a matcher if
namespace is not empty
which routes tor1
- Have a second matcher which matches everything else (because
continue=false
by default) and goes tor2
Use-case
Now imagine having a more complicated setup with dynamic settings. For example dynamic Slack channels as such:
channel: some-channel-{{ (index .Alerts 0).Labels.namespace }}
The use-case here is to dynamically route stuff based on alert labels. It's easy to check if a namespace is present, then route to a receiver with such channel variable. However the problem is that there is no way to know if the channel exists and if the alert gets send.
This would result in an error like:
level=error ts=2022-06-16T18:51:31.451Z caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed" num_alerts=11 err="slack[0]: notify retry canceled due to unrecoverable error after 1 attempts: channel "****redacted****": unexpected status code 404: channel_not_found"
Feature request
- matchers:
receiver: r2
routes:
- matchers:
- namespace!=""
receiver: r1
continue_on_receiver_failure: true
- matchers:
receiver: r2
Having an option like continue_on_receiver_failure
which would be the same as continue
in its behaviour but only triggers when it hits an error while sending and then continues with other matchers.
The current log would not be a level=error
anymore but a level=info
and only goes into level=error
if when no other matcher is going to catch/send it. For example keep the state and after the full evaluation of the routes, it should know if it did send it somewhere eventually.
Other info
I also think this is a very valuable thing to have in general. Let's say, some receiver endpoint like Slack is down. Then we can automatically fall back to an other receiver endpoint. Without having the need to constantly send it to both endpoints and introducing noise.
After reading the use case, it seems to me it would make more sense to have multiple receivers for your slack channels, and use the routing tree to properly direct alerts to the correct receivers rather than using a single receiver and templating the channel
setting. This effectively moves the routing logic into the receiver + template. Is there something that makes doing this with routes infeasible?
Yea I think that's missing the use-cases;
- Knowing if a receiver endpoint actually exists
- Knowing if the receiver endpoint is actually healthy + able to send the alert to it (return status OK or not?)