alertmanager Creating new metric for filtering 4xx errors

Creating a new metric for filtering 4xx errors - numTotal4xxFailedNotifications
Modifying the return of Notify function to return a boolean value is4xx. We will use this value to update the above metric in notify.go.

func (n *Notifier) Notify(ctx context.Context, alert ...*types.Alert) (bool, bool, error) { }

retry, is4xx, err := r.integration.Notify(ctx, sent...)

For this change we have to update the Notify function in all the integrations. For now masked this change by returning false for this variable from all other integrations. only updated this variable when coming from SNS. Will update this boolean value for other integrations after discussion with the community.
The is4xx variable is used to increment the metric inside exec function of the retry Stage. This variable helps in identifying if the error thrown is 4xx or not.

if is4xx {
       r.metrics.numTotal4xxFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
}

Wrote a test TestRetryStageWithErrorCode in notify_test.go to test if this metric is getting incremented or not.

Files Impacted:

notify/email/email.go notify/email/email_test.go notify/notify.go notify/notify_test.go notify/opsgenie/opsgenie.go notify/pagerduty/pagerduty.go notify/pagerduty/pagerduty_test.go notify/pushover/pushover.go notify/slack/slack.go notify/sns/sns.go notify/telegram/telegram.go notify/test/test.go notify/victorops/victorops.go notify/victorops/victorops_test.go notify/webhook/webhook.go notify/wechat/wechat.go

May 21 '22 02:05 krishnateja325

Ping @gotjosh @simonpasquier what do you think about this?

May 23 '22 17:05 roidelapluie

@roidelapluie @gotjosh @simonpasquier any comments on this?

Jun 13 '22 03:06 krishnateja325

Seems like I got tagged on this during some of my PTO and apparently I missed the tag - apologies about that.

I'm not sure this is the direction that I want to go. We have a alertmanager_notification_latency_seconds histogram that tracks the latency for notifications could we not add the status codes to that histogram as labels instead?

Keeping in mind that this would increase the number of series produced by the Alertmanager but perhaps there are some areas where we can compromise.

Off the top of my head:

Don't expose a series for integrations that don't have any configuration - right now we expose a series for all integrations regardless of whether they're used or not.
If we think that including status_code would be a cardinality issue, an alternative could be to use 2xx, 3xx, 4xx and 5xx as the status code to reduce the number of series.

One last thing I'd like to get some clarity on is: What exactly is the problem that you're trying to solve here? In my experience notifications_failed_total has been enough to be able to tell if integration is correctly or not.

Jun 13 '22 13:06 gotjosh

Hi Josh @gotjosh. Thanks a lot for your comments. The main problem that we are trying to solve is to identify the notification failure from customer input failure (for eg: permission deny from integrations that is expected to happen) with the programatic error (the programatic error + integrations 5xx error). To address concern of the cardinality issue, we could do

Publish a new metric for SNS integration 4xx
Publish a notifications_failed_total with 4xx and 5xx labels

Please let us know what's your preference

Sep 08 '22 17:09 qinxx108

Discussed this during the Alertmanager Working Group - closing due to https://github.com/prometheus/alertmanager/pull/3094 being merged.

Feb 24 '23 14:02 gotjosh

alertmanager alertmanager copied to clipboard

Creating new metric for filtering 4xx errors

Files Impacted:

alertmanager
alertmanager copied to clipboard