Firing -> Resolved -> Firing issue with prometheus + alertmanager
Hello! I cannot deal with one problem, since the moment when began to use prometheus. I use blackbox_exporter for monitoring of accessibility of the domain on http and also alertmanager for sending messages to slack, telegram and the SMS. Sometimes there is a situation when the domain is unavailable on http and prometheus sends the correct notification messages that the server is unavailable, but the message with the status of "resolved" at some point comes when the server is unavailable at this time, and through some time the message with the status of "firing" comes again. The configuration is given below: Prometheus (part of config):
global:
scrape_interval: 10s
scrape_timeout: 8s
evaluation_interval: 1m
rule_files:
- /etc/prometheus/production.rules
scrape_configs:
- job_name: 'myserver.com_n10.myserver.com_ui'
scrape_interval: 30s
metrics_path: /probe
params:
module: [ myserver.com_n10.myserver.com_ui ] # Look for a HTTP 200 response.
static_configs:
- targets:
- myserver.com
labels:
server: 'myserver.com'
network: 'n10.myserver.com'
domain: 'n10.myserver.com'
type: 'http_ui'
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
replacement: http://${1}/login
- source_labels: [__param_target]
regex: (.*)
target_label: instance
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 127.0.0.1:9115 # Blackbox exporter.
blackbox.yml
myserver.com_n10.myserver.com_ui:
prober: http
timeout: 10s
http:
valid_status_codes: [ 200, 302, 301] # Defaults to 2xx
method: GET
headers:
Host: someheader.com
no_follow_redirects: true
fail_if_ssl: false
tls_config:
insecure_skip_verify: false
protocol: "tcp"
preferred_ip_protocol: "ip4"
alertmanager.yml
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'smtp9.myserver.com:25'
smtp_from: '[email protected]'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
resolve_timeout: 1h
route:
group_by: ['alertname', 'instance', 'type']
group_wait: 30s
group_interval: 5m
repeat_interval: 5m
route:
receiver: slack_general
routes:
- match:
severity: sms
receiver: sms_admins
continue: true
- match:
severity: warning
receiver: slack_general
repeat_interval: 1h
continue: true
- match:
severity: warning
receiver: admins
repeat_interval: 1h
- match_re:
severity: critical|sms
receiver: slack_general
continue: true
- match_re:
severity: critical|sms
receiver: admins
receivers:
- name: slack_general
slack_configs:
- api_url: https://hooks.slack.com/services/SLACK_URI
channel: '#notifications'
send_resolved: true
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] '
text: '{{ range $val := .Alerts }}
Alert: <http://myserver.com:9095/#/alerts|{{ index $val.Annotations "description" }}>
{{ end}}'
- name: admins
webhook_configs:
- send_resolved: True
url: http://127.0.0.1:9088/alert/telegram_id1,telegram_id2
templates:
- '/etc/prometheus/alertmanager/templates/default.tmpl'
#test for rule inhibition, used for free space trigger for now
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance', 'mountpoint']
alert.rules
ALERT HTTPProblemOnUI
IF probe_success{job="myserver.com_n10.myserver.com_ui"} != 1
FOR 5m
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "HTTP check of myserver.com on server myserver.com",
description = "HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com",
}
Part of alertmanager journal (alert HTTPProblemOnUI[4f4a71c][active] ):
Aug 17 20:14:38 myserver.com alertmanager[13966]: time="2017-08-17T20:14:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:15:08 myserver.com alertmanager[13966]: time="2017-08-17T20:15:08Z" level=debug msg="flushing [HTTPProblemOnUI[ff7e520][active] HTTPProblemOnUI[31bfcf0][active] HTTPProblemOnUI[7eb9f05][active] HTTPProblemOnUI[abcb7d9][active] HTTPProblemOnUI[e3c0fa7][active] HTTPProblemOnUI[c7c16dc][active] HTTPProblemOnUI[4215379][active] HTTPProblemOnUI[c158c7c][active] HTTPProblemOnUI[f5691c0][active] HTTPProblemOnUI[4f4a71c][active] HTTPProblemOnUI[ee6848c][active] HTTPProblemOnUI[8c65933][active] HTTPProblemOnUI[82dcc81][active] HTTPProblemOnUI[7da68da][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="HTTPProblemOnUI"} source="dispatch.go:426"
Aug 17 20:15:08 myserver.com alertmanager[13966]: time="2017-08-17T20:15:08Z" level=debug msg="flushing [HTTPProblemOnUI[ff7e520][active] HTTPProblemOnUI[c7c16dc][active] HTTPProblemOnUI[f5691c0][active] HTTPProblemOnUI[8c65933][active] HTTPProblemOnUI[4215379][active] HTTPProblemOnUI[7eb9f05][active] HTTPProblemOnUI[abcb7d9][active] HTTPProblemOnUI[e3c0fa7][active] HTTPProblemOnUI[c158c7c][active] HTTPProblemOnUI[82dcc81][active] HTTPProblemOnUI[4f4a71c][active] HTTPProblemOnUI[ee6848c][active] HTTPProblemOnUI[31bfcf0][active] HTTPProblemOnUI[7da68da][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="HTTPProblemOnUI"} source="dispatch.go:426"
Aug 17 20:15:38 myserver.com alertmanager[13966]: time="2017-08-17T20:15:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:16:38 myserver.com alertmanager[13966]: time="2017-08-17T20:16:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Here it went into state "resolved" (but the query returns that the check at this point returns probe_success 0, i'll show it below)
...
Aug 17 20:33:41 myserver.com alertmanager[13966]: time="2017-08-17T20:33:41Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Aug 17 20:34:38 myserver.com alertmanager[13966]: time="2017-08-17T20:34:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][resolved] component=dispatcher source="dispatch.go:184"
...
Then again "active":
Aug 17 20:48:38 myserver.com alertmanager[13966]: time="2017-08-17T20:48:38Z" level=debug msg="Received alert" alert=HTTPProblemOnUI[4f4a71c][active] component=dispatcher source="dispatch.go:184"
Here data from a prometheus: http://joxi.net/Dr81e9eikPQDLm http://joxi.net/1A51eJeiK1g1GA
Everything began in:
~]# date -d @1503000569.746
Thu Aug 17 20:09:29 UTC 2017
AND ended in:
~]# date -d @1503002909.746
Thu Aug 17 20:48:29 UTC 2017
There are data which to this interval were sent by prometheus to alertmanager:
[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:14:38.52Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://myserver.com:9090/graph"}]
Then for some reason came alert with "endsAt" and respectively the notification message "resolved" came:
[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:14:38.52Z","endsAt":"2017-08-17T20:34:38.52Z","generatorURL":"http://myserver.com:9090/graph"}]
And then the notification message with the status "firing" came again:
[{"labels":{"alertname":"HTTPProblemOnUI","domain":"n10.myserver.com","instance":"http://myserver.com/login","job":"myserver.com_n10.myserver.com_ui","network":"n10.myserver.com","server":"myserver.com","severity":"critical","type":"http_ui"},"annotations":{"description":"HTTP check of myserver.com on server myserver.com","summary":"HTTP check of myserver.com (UI domain on myserver.com) on server s43.myserver.com"},"startsAt":"2017-08-17T20:47:38.52Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://myserver.com:9090/graph"}]
If any additional information is necessary, I am ready to provide it. One nuance, at us is used many alerts with an identical name (for group of messages), I provided only one type of alert (HTTPProblemOnUI). Such problems happen not only with alerts of this group. Thank you! P.S. sorry for my poor English :)
I forgot to specify versions of a software:
~]# prometheus -version
prometheus, version 1.6.1 (branch: master, revision: 4666df502c0e239ed4aa1d80abbbfb54f61b23c3)
build user: root@ca1b452514b7
build date: 20170419-13:26:24
go version: go1.8.1
~]# alertmanager -version
alertmanager, version 0.8.0 (branch: HEAD, revision: 74e7e48d24bddd2e2a80c7840af9b2de271cc74c)
build user: root@439065dc2905
build date: 20170720-14:14:06
go version: go1.8.3
@artemlive First of all thank you very much for this very detailed bug report!
Before digging further into this, two questions:
- Is there a particular reason why you specify such a small
repeat_intervalin your Alertmanager configuration? - Can you assure, that the target (myserver.com) was down the entire time between the first firing alert, the resolved notification and the second firing alert?
Hello!
- These services are very critical for us, therefore we decided to set such repeat_interval (This interval was selected since we used the same interval on other monitoring system with which we migrate on prometheus+alertmanager now, most likely we will increase this interval further)
- Yes, it is sure. I wrote result of a request from a prometeus, in this period service was down.
Here data from a prometheus: http://joxi.net/Dr81e9eikPQDLm http://joxi.net/1A51eJeiK1g1GA
Everything began in: ~]# date -d @1503000569.746 Thu Aug 17 20:09:29 UTC 2017 AND ended in: ~]# date -d @1503002909.746 Thu Aug 17 20:48:29 UTC 2017
Many thanks for your response!
This is quite confusing and sounds like a bug. Are you able to reproduce this with a minimal config? It is difficult for me to reconstruct your exact environment for debugging.
I'll try to reproduce it today or tomorrow. Thanks!
I have this problem, too. My alertmanager version is 0.5.1 and config is
global:
smtp_smarthost: 'localhost:25'
smtp_from: '[email protected]'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
hipchat_auth_token: '1234556789'
hipchat_url: 'https://hipchat.foobar.org/'
templates:
- '/etc/alertmanager/template/*.tmpl'
route:
group_by: ['alertname','cluster','service']
group_wait: 30s
group_interval: 5m
repeat_interval: 1m
receiver: webhook
routes:
- match_re:
service: ^(foo1|foo2|baz)$
receiver: webhook
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname','cluster','service']
receivers:
- name: 'webhook'
webhook_configs:
- url: http://alert-collector:17130/alert/v1.0/alerts
We are currently on v0.11.0. I recommend you update.
We have upgraded to v0.11.0 but still have this problem. @stuartnelson3
If you create several alerts with the same name, this problem is reproduced. I added another debug recipient in an alertmanager and it seems I figured out what the problem is.(https://github.com/prometheus/alertmanager/blob/master/examples/webhook/echo.go) In my case it was like this:
"receiver": "debug", "status": "resolved", "alerts": [ { "status": "resolved", "labels": { "alertname": "TCP_CHECK", "instance": "127.0.0.1:9115", "job": "tcp_service_rmi", "severity": "sms" }, "annotations": { "description": "RMI port 8080 on myserver is unreachable", "summary": "RMI port 8080 on myserver is unreachable " }, "startsAt": "2018-01-12T13:10:37.803Z", "endsAt": "2018-01-12T13:42:37.804Z", "generatorURL": "http://prometheus:9090/graph?g0.expr=probe_myserver2_8080_success%7Bjob%3D%22tcp_service_rmi%22%7D+%21%3D+1\u0026g0.tab=0" } ],
generatorURL data is displayed from another service for the same alert. I added the unique alert name for each alert and now the problem is not reproduced.
some kind of magic... There is an alert added in prometheus:
- alert: SS_CONFIGURATIONS_ADS expr: probe_ss_config{job=~".*_ui_health"} != 1 for: 5m labels: severity: critical annotations: description: SS_CONFIGURATIONS service {{ $labels.name }} on {{ $labels.network }}) on server {{ $labels.server }} summary: configs {{ $labels.name }}{{ $labels.network }}
There is an element of data named probe_ss_config. Here is the result of query, which returns sum of all checks for 2h, which equals to 0, means that service was down (all checks for 2h retruned 0). Vector of data for 2h (all checks is equal to 0) http://joxi.net/n2YB707ijzlYEA And sum of all http://joxi.net/8AnWDNDTqJpqor elements is equal to 0. But sometimes, prometheus sends "resolved" status for an alert:
"receiver": "debug", "status": "resolved", "alerts": [ { "status": "resolved", "labels": { "alertname": "SS_CONFIGURATIONS_ADS", "domain": "xxxx.myserver.com", "instance": "http://ns7.myserver.com", "job": "ns7.myserver.com_xxxx.myserver.com_ui_health", "name": "AnalyticsLogConfig", "network": "xxxx.myserver.com", "server": "ns7.myserver.com", "service": "ss_configurations", "severity": "critical", "type": "health_check" }, "annotations": { "description": "SS_CONFIGURATIONS service AnalyticsLogConfig on xxxx.myserver.com) on server ns7.myserver.com", "summary": "configs AnalyticsLogConfigxxxx.myserver.com" }, "startsAt": "2018-01-23T14:32:05.524923261Z", "endsAt": "2018-01-23T14:33:05.524853742Z", "generatorURL": "http://prometheus-high-cpu:9090/graph?g0.expr=probe_ss_config%7Bjob%3D~%22.%2A_ui_health%22%7D+%21%3D+1\u0026g0.tab=1" } ], "groupLabels": { "alertname": "SS_CONFIGURATIONS_ADS" }, "commonLabels": { "alertname": "SS_CONFIGURATIONS_ADS", "domain": "xxxx.myserver.com", "instance": "http://ns7.myserver.com", "job": "ns7.myserver.com_xxxx.myserver.com_ui_health", "name": "AnalyticsLogConfig", "network": "xxxx.myserver.com", "server": "ns7.myserver.com", "service": "ss_configurations", "severity": "critical", "type": "health_check" }, "commonAnnotations": { "description": "SS_CONFIGURATIONS service AnalyticsLogConfig on xxxx.myserver.com) on server ns7.myserver.com", "summary": "configs AnalyticsLogConfigxxxx.myserver.com" }, "externalURL": "http://prometheus.myserver.com:9095", "version": "4", "groupKey": "{}/{severity=~"^(?:critical|sms)$"}:{alertname="SS_CONFIGURATIONS_ADS"}" }
But probe_ss_config is equal to 0, as I showed in the screenshot below. prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98628a0463dddc90528220c94de5032d1a0) build user: root@615b82cb36b6 build date: 20171108-07:11:59 go version: go1.9.2
alertmanager, version 0.13.0 (branch: HEAD, revision: fb713f6d8239b57c646cae30f78e8b4b8861a1aa) build user: root@d83981af1d3d build date: 20180112-10:32:46 go version: go1.9.2
@artemlive do you still see the issue with the latest AlertManager release?
@simonpasquier I can confirm this is not resolved with the latest release.
@justinjk007 you would have to open another issue to provide your specific details.
@artemlive I've read again your description and it doesn't seem related to AlertManager since it acts properly on the information received from Prometheus. If anything, it is a problem with the alerting engine though I'm suspicious that Prometheus computes wrongly the alert's status (it might be that you've got a failed scrape which resets the alert). To alleviate flapping, you could use an aggregation over time function like this: min_over_time(probe_ss_config{job=~".*_ui_health"}[5m]) != 1.
@simonpasquier opened #1546
I too have what I think may be the same issue. Prometheus continues to report the alert in active state to AlertManager, however AlertManager reports the alert as closed and then re-creates it minutes later.
The Alertmanager logs at debug level (ts and aggrGroup removed to reduce noise):
Feb 25 19:55:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 19:57:29 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 19:58:56 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 19:58:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 19:58:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 20:01:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][resolved]]
Feb 25 20:01:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][resolved]]
Feb 25 20:01:57 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:03:26 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:04:56 host alertmanager[16012]: level=debug caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:04:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 20:04:56 host alertmanager[16012]: level=debug caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
The Alert state goes from active to resolved and back again whilst Prometheus itself continues to report the alert is active. There is a corresponding gap in the 'Received alert' messages from Prometheus due to the Prometheus configuration being reloaded which happens frequently in our environment.
The gap in messages is less than the resolve_timeout.
AlertManager Details:
GoVersion: go1.10.3
Revision: 462c969d85cf1a473587754d55e4a3c4a2abc63c
Version: 0.15.0
Config Summary:
global:
resolve_timeout: 10m
http_config: {}
route:
receiver: ops_team
group_by:
- alertname
- site_code
- cluster
- instance
- instance_id
- resource_id
- backend
- service
- interface_id
routes:
- receiver: ops
match:
team: ops_team
continue: true
group_wait: 0s
group_interval: 3m
repeat_interval: 2h
I will update to AlertManager 0.16.1 and see if it helps, however I don't see anything related to this in the Changelog.
Further investigations on this show that the issue for us was introduced in Prometheus 2.4, #4550
Prometheus now sends the endsAt field to the Alertmanager with a very short timeout until AlertManager can mark the alert as resolved. This overrides the resolve_timeout setting in AlertManager and creates the firing->resolved->firing behavior if Prometheus does not resend the alert before the short timeout.
We will try increasing the 'resend_delay' on Prometheus to 2m and see if there are any side effects.
I will update to AlertManager 0.16.1 and see if it helps, however I don't see anything related to this in the Changelog.
@steve-exley Yes, please try anything above and including 0.15.3 as those include https://github.com/prometheus/alertmanager/pull/1611.
We still had some of these issues on 0.16.1, the underlying cause is that the endsAt time period generated by Prometheus does not leave much margin for error and AlertManager resolves the Alert if it doesn't get that next update from Prometheus in time.
I think changing the Prometheus 'rules.alert.resend-delay to 2m will resolve this in our scenario because the endsAt value is calculated from that. It's been less than 24hr so far since we made the change and too early to be sure.
I also raised a Proposal on Prometheus #5277 to discuss whether the endsAt time period is too short.
generatorURL data is displayed from another service for the same alert. I added the unique alert name for each alert and now the problem is not reproduced.
I can confirm the same behavior on:
prometheus, version 2.3.2+ds (branch: debian/sid, revision: 2.3.2+ds-1~bpo9+1)
alertmanager, version 0.15.3+ds (branch: debian/sid, revision: 0.15.3+ds-2~bpo9+1)
Unique alert names fix the issue.
+1
mark
same problem:
alertmanager version: v0.18.0 prometheus version: v2.17.1
level=debug ts=2020-07-29T09:11:23.373Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[67399ab][active] level=debug ts=2020-07-29T09:11:23.373Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[1aa03ad][active] level=debug ts=2020-07-29T09:11:23.373Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[cadfe5a][resolved] level=debug ts=2020-07-29T09:11:28.278Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[61b70c9][resolved] level=debug ts=2020-07-29T09:11:28.278Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[9ff2c20][active] level=debug ts=2020-07-29T09:11:33.321Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[18f376c][active] level=debug ts=2020-07-29T09:11:38.304Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[fbd7908][resolved] level=debug ts=2020-07-29T09:11:38.304Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[9ff2c20][resolved] level=debug ts=2020-07-29T09:11:43.321Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[faf133d][active] level=debug ts=2020-07-29T09:11:43.322Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=node-bp-code2837-net_recv[9d7d714][resolved]
i think i faced similar in my 2406