alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

non business hours alert resolved status not acknowledged by alertmanager and not sent to configured receiver

Open alexandrumarian-portal opened this issue 2 years ago • 4 comments

Hello everyone,

Please help me understand whether I misconfigured the Prometheus' Alertmanager in any way.

The scenario is the following: If the alert is triggered during business hours, the notification is being sent . If the alert is triggered during non business hours, the notification is not being sent .

If the alert is resolved during non business hours (in Prometheus), the event is not acknowledged by alertmanager and therefore the resolved status is not being sent towards the configured receiver (PagerDuty, in this case).

Please help me understand where the issue is coming from.

What did you do? Configured altermanager to send an alert only during business hours interval configured in alertmanager.yml

time_intervals:
  - name: only_in_business_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
        - start_time: "07:00"
          end_time: "16:00"
  - name: weekend
    time_intervals:
      - weekdays: ['saturday','sunday']

Below there is the alert rule for business hours

- name: ssl_certificate_expiry
  rules:
  - alert: cert_expiring_date
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
    for: 10m
    labels:
      severity: warning
      only_in_business_hours: true
    annotations:
      summary:  The SSL certificate will expire on {{ $labels.instance }}
      description: "SSL certificate on target will expire in less than 1 week."

What did you expect to see?

If an alert is triggered during non business hours, the alert is not sent and it waits until business hours begin. If the alert is resolved during non business hours, the notification should be sent to the configured receiver.

What did you see instead? Under which circumstances?

If the alert is resolved during non business hours (in Prometheus), the event is not acknowledged by alertmanager and therefore the resolved status is not being sent towards the configured receiver (PagerDuty, in this case).

Environment

  • System information:
Linux 3.10.0-1160.31.1.el7.x86_64 x86_64
  • Alertmanager version:
alertmanager, version 0.26.0 (branch: HEAD, revision: d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d)
  • Prometheus version:
prometheus, version 2.40.3 (branch: HEAD, revision: 84e95d8cbc51b89f1a69b25dd239cae2a44cb6c1)
  • Alertmanager configuration file:
global:
  resolve_timeout: 3m

route:
  group_by: ['alertname', 'cluster', 'service', 'url']

  group_wait: 30s

  group_interval: 2m

  repeat_interval: 3h
  receiver: 'pagerduty_channel'

  routes:

  - matchers:
      - only_in_business_hours = true
    continue: true
    active_time_intervals:
      - only_in_business_hours

receivers:
  - name: "pagerduty_channel"
    pagerduty_configs:
    - routing_key: "aBeautifulAndColorfulKey"
      send_resolved: true 

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'cluster', 'service']

time_intervals:
  - name: only_in_business_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
        - start_time: "07:00"
          end_time: "16:00"
  - name: weekend
    time_intervals:
      - weekdays: ['saturday','sunday']

  • Prometheus configuration file:
global:
  scrape_interval:     2s
  evaluation_interval: 2s
  query_log_file: /prometheus/logs/query.log

rule_files:
  - "alert.rules"

scrape_configs:
  - job_name: prometheus
    static_configs:
    - targets:
      - localhost:9090

alerting:
  alertmanagers:
  - scheme: 'http'
    static_configs:
    - targets:
      - 'localhost:9093'

  • Logs:
40007186:ts=2023-11-15T20:24:46.720Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup="{}/{only_in_business_hours=\"true\"}:{alertname=\"cert_expiring_date\", url=\"https://address.net/\"}" msg=flushing alerts=[cert_expiring_date[0308b61][resolved]]
40007453-ts=2023-11-15T20:24:46.720Z caller=notify.go:877 level=debug component=dispatcher msg="Notifications not sent, route is not within active time"

alexandrumarian-portal avatar Nov 23 '23 12:11 alexandrumarian-portal

Alertmanager is working as intended. If a route's active_time_intervals do not match, that route will not be active - neither to send a "firing" notification, nor to send a "resolved" notification.

dswarbrick avatar Nov 23 '23 13:11 dswarbrick

And if I want to send a notification to the configured receiver (when the alert is resolved outside of active_time_intervals), how can I achieve that ? Thank you.

alexandrumarian-portal avatar Nov 23 '23 16:11 alexandrumarian-portal

Hi! 👋 I do not believe it's possible to tell Alertmanager to send resolved notifications for alerts that are silenced, muted or outside active time intervals. Someone else might be able to correct me if this is wrong.

grobinson-grafana avatar Nov 23 '23 19:11 grobinson-grafana

Generally this type of thing is better configured in your notification provider, e.g. PagerDuty, OpsGenie etc, since that's where you configure your teams, on-call schedules, escalation rules etc. Just let Alertmanager blast everything through to PagerDuty (regardless of time / day), and configure your custom notification behaviour there.

dswarbrick avatar Nov 23 '23 23:11 dswarbrick