alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Feature request - Matcher continue on receiver failure

Open wiardvanrij opened this issue 2 years ago • 2 comments

Setup

We have 2 Slack receivers: r1 and r2.

We have routes with matchers like so:

 - matchers:
	receiver: r2
	routes:
		- matchers:
		    - namespace!=""
              receiver: r1
         - matchers:
         	  receiver: r2

(sorry for formatting, can't get it right, you'll get the idea no?)

So what this does is:

  • Have a main fallback to r1 in case nothing matches. However in this example, it should never hit that due to the last matcher always matching
  • Have a matcher if namespace is not empty which routes to r1
  • Have a second matcher which matches everything else (because continue=false by default) and goes to r2

Use-case

Now imagine having a more complicated setup with dynamic settings. For example dynamic Slack channels as such:

channel: some-channel-{{ (index .Alerts 0).Labels.namespace }}

The use-case here is to dynamically route stuff based on alert labels. It's easy to check if a namespace is present, then route to a receiver with such channel variable. However the problem is that there is no way to know if the channel exists and if the alert gets send.

This would result in an error like:

level=error ts=2022-06-16T18:51:31.451Z caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed" num_alerts=11 err="slack[0]: notify retry canceled due to unrecoverable error after 1 attempts: channel "****redacted****": unexpected status code 404: channel_not_found"

Feature request

 - matchers:
	receiver: r2
	routes:
		- matchers:
		    - namespace!=""
              receiver: r1
              continue_on_receiver_failure: true
         - matchers:
         	  receiver: r2

Having an option like continue_on_receiver_failure which would be the same as continue in its behaviour but only triggers when it hits an error while sending and then continues with other matchers. The current log would not be a level=error anymore but a level=info and only goes into level=error if when no other matcher is going to catch/send it. For example keep the state and after the full evaluation of the routes, it should know if it did send it somewhere eventually.

Other info

I also think this is a very valuable thing to have in general. Let's say, some receiver endpoint like Slack is down. Then we can automatically fall back to an other receiver endpoint. Without having the need to constantly send it to both endpoints and introducing noise.

wiardvanrij avatar Jun 21 '22 19:06 wiardvanrij

After reading the use case, it seems to me it would make more sense to have multiple receivers for your slack channels, and use the routing tree to properly direct alerts to the correct receivers rather than using a single receiver and templating the channel setting. This effectively moves the routing logic into the receiver + template. Is there something that makes doing this with routes infeasible?

benridley avatar Aug 22 '22 00:08 benridley

Yea I think that's missing the use-cases;

  • Knowing if a receiver endpoint actually exists
  • Knowing if the receiver endpoint is actually healthy + able to send the alert to it (return status OK or not?)

wiardvanrij avatar Aug 22 '22 12:08 wiardvanrij