alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Alert not send

Open roidelapluie opened this issue 6 years ago • 6 comments

What did you do?

We have an alert with 3 webhook receivers.

The alert was sent yesterday evening, then again today.

But the alerts of today were not sent to the webhook receivers.

What did you expect to see?

Calls to webhooks yesterday and today.

What did you see instead? Under which circumstances?

The webhooks were only called yesterday.

Environment

  • Alertmanager version:

    0.18.0

  • Prometheus version:

    2.11 and 2.12

Deserialization of NFLOG during the second incident (where we did not receive notifications):

Entry: {}/{recipient=~"^(?:(.*,)?appteam/ticket(,.*)?)$"}/{repeat_interval=""}:{alertid="XXX-00014", customer_name="XXX", env="prod", hostname="XXX-mtprd01", priority="P1", recipient="appteam/ticket,XXX/circuit", title="JVM Down"}:appteam/ticket/webhook/0
{
  "entry": {
    "groupKey": "XXX",
    "receiver": {
      "groupName": "appteam/ticket",
      "integration": "webhook"
    },
    "timestamp": "2019-09-24T15:53:55.222089917Z",
    "firingAlerts": [
      "8684883655238988612"
    ]
  },
  "expiresAt": "2019-09-29T15:53:55.222089917Z"
}

Deserialization AFTER today event is resolved:

Entry: {}/{recipient=~"^(?:(.*,)?appteam/ticket(,.*)?)$"}/{repeat_interval=""}:{alertid="XXX-00014", customer_name="XXX", env="prod", hostname="XXX-mtprd01", priority="P1", recipient="appteam/ticket,XXX/circuit", title="JVM Down"}:appteam/ticket/webhook/0
{
  "entry": {
    "groupKey": "XXX",
    "receiver": {
      "groupName": "appteam/ticket",
      "integration": "webhook"
    },
    "timestamp": "2019-09-25T15:20:55.074281100Z",
    "resolvedAlerts": [
      "8684883655238988612"
    ]
  },
  "expiresAt": "2019-09-30T15:20:55.074281100Z"
}

The timestamp of the first one is the BEGINNING of the first event. The timestamp of the second one is the END of the second event.

I would have expected the first one to be the END of the first event?

roidelapluie avatar Sep 25 '19 20:09 roidelapluie

Picture of the first incident in prometheys (pending then firing)

pro

roidelapluie avatar Sep 25 '19 20:09 roidelapluie

From a backup between the incidents (24th 20h). we see that the alert is still there (but it was no longer firing in prometheus)

Entry: {}/{recipient=~"^(?:(.*,)?appteam/ticket(,.*)?)$"}/{repeat_interval=""}:{alertid="XXX-00014", customer_name="XXX", env="prod", hostname="XXX-mtprd01", priority="P1", recipient="appteam/ticket,XXX/circuit", title="JVM Down"}:appteam/ticket/webhook/0
{
  "entry": {
    "groupKey": "XXX",
    "receiver": {
      "groupName": "appteam/ticket",
      "integration": "webhook"
    },
    "timestamp": "2019-09-24T15:53:55.222089917Z",
    "firingAlerts": [
      "8684883655238988612"
    ]
  },
  "expiresAt": "2019-09-29T15:53:55.222089917Z"
}

roidelapluie avatar Sep 26 '19 08:09 roidelapluie

For anyone interested:

https://github.com/roidelapluie/nflogerror_exporter

roidelapluie avatar Sep 26 '19 13:09 roidelapluie

My test has shown that I have another alert in this state.

My prometheus config used for the test

- job_name: nflogerror
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - files:
    - /etc/prometheus/prometheus.d/nflogerror_exporter_*.yml
    refresh_interval: 5m
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: ALERTS_IN_NFLOG_NOT_FIRING_([0-9]+)_([0-9]+)_.*
    target_label: GROUPID
    replacement: $1
    action: replace
  - source_labels: [__name__]
    separator: ;
    regex: ALERTS_IN_NFLOG_NOT_FIRING_([0-9]+)_([0-9]+)_.*
    target_label: ALERTID
    replacement: $2
    action: replace
  - source_labels: [__name__]
    separator: ;
    regex: (ALERTS_IN_NFLOG_NOT_FIRING)_[0-9]+_[0-9]+_(.*)
    target_label: __name__
    replacement: ${1}_$2
    action: replace

My PromQL query:

time () - ALERTS_IN_NFLOG_NOT_FIRING_timestamp_seconds > 86400

roidelapluie avatar Sep 26 '19 15:09 roidelapluie

I will now see the outcome in the coming days. So far the notifications impacted all implies multiple recipients.

roidelapluie avatar Sep 26 '19 15:09 roidelapluie

@roidelapluie is this still worth pursuing?

TheMeier avatar Nov 14 '25 16:11 TheMeier