alertmanager
alertmanager copied to clipboard
Creating new metric for filtering 4xx errors
- Creating a new metric for filtering 4xx errors -
numTotal4xxFailedNotifications
- Modifying the return of Notify function to return a boolean value is4xx. We will use this value to update the above metric in notify.go.
func (n *Notifier) Notify(ctx context.Context, alert ...*types.Alert) (bool, bool, error) { }
retry, is4xx, err := r.integration.Notify(ctx, sent...)
- For this change we have to update the Notify function in all the integrations. For now masked this change by returning false for this variable from all other integrations. only updated this variable when coming from SNS. Will update this boolean value for other integrations after discussion with the community.
- The is4xx variable is used to increment the metric inside exec function of the retry Stage. This variable helps in identifying if the error thrown is 4xx or not.
if is4xx {
r.metrics.numTotal4xxFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
}
- Wrote a test TestRetryStageWithErrorCode in notify_test.go to test if this metric is getting incremented or not.
Files Impacted:
notify/email/email.go notify/email/email_test.go notify/notify.go notify/notify_test.go notify/opsgenie/opsgenie.go notify/pagerduty/pagerduty.go notify/pagerduty/pagerduty_test.go notify/pushover/pushover.go notify/slack/slack.go notify/sns/sns.go notify/telegram/telegram.go notify/test/test.go notify/victorops/victorops.go notify/victorops/victorops_test.go notify/webhook/webhook.go notify/wechat/wechat.go
Ping @gotjosh @simonpasquier what do you think about this?
@roidelapluie @gotjosh @simonpasquier any comments on this?
Seems like I got tagged on this during some of my PTO and apparently I missed the tag - apologies about that.
I'm not sure this is the direction that I want to go. We have a alertmanager_notification_latency_seconds
histogram that tracks the latency for notifications could we not add the status codes to that histogram as labels instead?
Keeping in mind that this would increase the number of series produced by the Alertmanager but perhaps there are some areas where we can compromise.
Off the top of my head:
- Don't expose a series for integrations that don't have any configuration - right now we expose a series for all integrations regardless of whether they're used or not.
- If we think that including
status_code
would be a cardinality issue, an alternative could be to use2xx
,3xx
,4xx
and5xx
as the status code to reduce the number of series.
One last thing I'd like to get some clarity on is: What exactly is the problem that you're trying to solve here? In my experience notifications_failed_total
has been enough to be able to tell if integration is correctly or not.
Hi Josh @gotjosh. Thanks a lot for your comments. The main problem that we are trying to solve is to identify the notification failure from customer input failure (for eg: permission deny from integrations that is expected to happen) with the programatic error (the programatic error + integrations 5xx error). To address concern of the cardinality issue, we could do
- Publish a new metric for SNS integration 4xx
- Publish a notifications_failed_total with 4xx and 5xx labels
Please let us know what's your preference
Discussed this during the Alertmanager Working Group - closing due to https://github.com/prometheus/alertmanager/pull/3094 being merged.