alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Alertmanager routing tree doesn't respect the active time interval (?)

Open bollmann opened this issue 2 years ago • 11 comments

Hi devs,

I'm trying to use the recently added time_intervals and active_time_intervals features to enable an alert to go to our pager only during working hours. I'm defining my time interval and corresponding route as follows:

    time_intervals:
      - name: workinghours
        time_intervals:
          - times:
              - start_time: 08:00
                end_time: 18:00
            weekdays:
              - monday
              - tuesday
              - wednesday
              - thursday
              - friday
...
    route:
      routes:
        - receiver: oncall-pager
          matchers:
            - severity="workinghours"
          active_time_intervals:
            - workinghours

Furthermore, I have a prometheus rule with label severity="workinghours" defined. With this configuration, I would expect my prometheus rule to only be active during working hours, i.e., Monday to Friday from 8am until 6pm. However, for some reason my prometheus rule also fires and moreover gets routed to our pager outside of the hours 8am until 6pm. That is, I get paged even when I shouldn't get paged according to the above-defined time interval workinghours.

Did I do something wrong here in the configuration? Or might this be a problem with the K8s prometheus-operator (or the kube-prometheus-stack helm chart) through which I'm using this new alertmanager feature?

At the moment, I'm using the following versions:

alertmanager: v0.24.0 prometheus-operator: v0.62.0 kube-prometheus-stack helm chart: 44.4.1

Originally posted by @bollmann in https://github.com/prometheus/alertmanager/issues/2779#issuecomment-1427501988

bollmann avatar Feb 13 '23 08:02 bollmann

Have you checked the final configuration generated by the operator? I've tried a similar example on my local machine and it works. I'd enable --log.level=debug to see if anything pops up from the logs.

simonpasquier avatar Feb 24 '23 13:02 simonpasquier

You've defined the times for the hours you expect, but by default I believe the config is going to use UTC time. I would make sure the times you have are correct in UTC. v0.25.0 release support for time zones which has made my life easier.

cbryant42 avatar Feb 27 '23 12:02 cbryant42

I'm experiencing a similar issue using the following config

global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by:
  - alertname

  routes:

  - receiver: DevOutOfHours
    matchers:
    - namespace=~".+dev|.+uat"
    active_time_intervals:
    - outofhours
    - weekends
    mute_time_intervals:
    - officehours

  - receiver: TeamA
    matchers:
    - label_team="TeamA"

  - receiver: TeamB
    matchers:
    - label_team="TeamB"

receivers:
- name: DevOutOfHours
- name: default
- name: TeamA
- name: TeamB

time_intervals:
- name: officehours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "08:00"
      end_time: "20:00"
    location: Europe/London
- name: outofhours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "00:00"
      end_time: "08:00"
    - start_time: "20:00"
      end_time: "23:59"
    location: Europe/London

- name: weekends
  time_intervals:
  - weekdays: [saturday, sunday]
    location: Europe/London

I've tried copying and pasting into https://prometheus.io/webtools/alerting/routing-tree-editor/ and using { namespace="app-dev", label_team="TeamA"} as the test labal set, but my alerts keep routing to DevOutOfHours instead of TeamA during office hours.

I only want the dev and uat namespace alerts to be notified during working hours, not out of hours and weekends

dtwilliamsWork avatar Jan 08 '24 14:01 dtwilliamsWork

It looks to me like the matching is working fine. Matching is done top-down, so the DevOutOfHours route is matching first. Perhaps give the configuration doc a re-read to make sure you are familiar with the routing and configuration option. It seems like you may be confused about what active_time_intervals and mute_time_intervals do (I believe you have them inverted).

To me, it seems like you would want to set up sub-routes for the teams routes. My general structure for each team's routing looks like this: I have the default route that matches only team name, then I have sub-routes that match more specific alerts, etc. In your case, this would look like:

- receiver: TeamA
    matchers:
    - label_team="TeamA"
    routes:
      - receiver: TeamA
      matchers:
      - namespace=~".+dev|.+uat"
    mute_time_intervals:
      - outofhours
      - weekends
    active_time_intervals:
      - officehours

This matches any label_team="TeamA", then if namespace=~".+dev|.+uat" alertmanager will match the sub-route and fire under those times specified. And under this model, you can completely remove the DevOutOfHours route.

Let me know if that all makes sense, and solves your problem!

cbryant42 avatar Jan 08 '24 15:01 cbryant42

thanks for the reply, I'll give it a go.

I was using the example found here https://prometheus.io/docs/alerting/latest/configuration/#example

# All alerts with the service=inhouse-service label match this sub-route
    # the route will be active only during offhours and holidays time intervals.
  - receiver: 'on-call-pager'
    matchers:
      - service="inhouse-service"
    active_time_intervals:
      - offhours
      - holidays

I would expect my DevOutOfHours route only to be active out of hours and move on to the next route during office hours. Maybe I've understood it wrong.

dtwilliamsWork avatar Jan 09 '24 09:01 dtwilliamsWork

just saw this bit Additionally, the root node cannot have any active times.

let me try it with different subroutes

dtwilliamsWork avatar Jan 09 '24 10:01 dtwilliamsWork

not having much luck. Shouldn't the below just route to the default receiver? It seems like the time_intervals aren't having any impact. Running it using { namespace="app-dev", label_team="TeamA"} routes to TeamA when i assume it should only do during the weekends. I've tried adding an additional route within the receiver, but it didn't like it.


global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by:
  - alertname

  routes:

  - receiver: TeamA
    matchers:
    - label_team="TeamA"
    active_time_intervals:
      - weekends

  - receiver: TeamB
    matchers:
    - label_team="TeamB"
    active_time_intervals:
      - weekends

receivers:
- name: default
- name: TeamA
- name: TeamB

time_intervals:
- name: officehours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "08:00"
      end_time: "20:00"
    location: Europe/London
- name: outofhours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "00:00"
      end_time: "08:00"
    - start_time: "20:00"
      end_time: "23:59"
    location: Europe/London

- name: weekends
  time_intervals:
  - weekdays: [saturday, sunday]
    location: Europe/London

dtwilliamsWork avatar Jan 09 '24 10:01 dtwilliamsWork

this doesn't work either. This should route to TeamC on weekends and TeamD in office hours, but it always routes to TeamC. Are my time_intervals set correctly??


global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by:
  - alertname

  routes:

  - receiver: TeamA
    matchers:
    - label_team="TeamA"
    routes:
      - receiver: TeamC
        matchers:
        - namespace=~".+dev|.+uat"
        active_time_intervals:
          - weekends
      - receiver: TeamD
        matchers:
        - namespace=~".+dev|.+uat"
        active_time_intervals:
          - officehours

  - receiver: TeamB
    matchers:
    - label_team="TeamB"
    routes:
      - receiver: TeamD
        matchers:
        - namespace=~".+dev|.+uat"
        active_time_intervals:
          - weekends

receivers:
- name: default
- name: TeamA
- name: TeamB
- name: TeamC
- name: TeamD

time_intervals:
- name: officehours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "08:00"
      end_time: "20:00"
    location: Europe/London
- name: outofhours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "00:00"
      end_time: "08:00"
    - start_time: "20:00"
      end_time: "23:59"
    location: Europe/London

- name: weekends
  time_intervals:
  - weekdays: [saturday, sunday]
    location: Europe/London

dtwilliamsWork avatar Jan 09 '24 12:01 dtwilliamsWork

It's possible there is a formatting issue with the time_intervals. Yours do look different than my own. I believe the routes look correct.

I would check using Amtool and the online routing tree editor to confirm. I find this tool extremely useful: https://www.prometheus.io/webtools/alerting/routing-tree-editor/

Here's an example of one of my own time intervals:

time_intervals:
- name: interval_1
  time_intervals:
  - times:
    - start_time: '02:45'
      end_time: '23:45'
    weekdays: ['monday:friday']
    location: 'America/Chicago'

cbryant42 avatar Jan 09 '24 12:01 cbryant42

I think the subtle difference between the config of @cbryant42 and @dtwilliamsWork is a '-'.

Take for example (A):

time_intervals:
- name: officehours
  time_intervals:
  - weekdays: ['monday:friday']
  - times:
    - start_time: "08:00"
      end_time: "20:00"
    location: Europe/London

And (B):

time_intervals:
- name: officehours
  time_intervals:
  - weekdays: ['monday:friday']
    times:
    - start_time: "08:00"
      end_time: "20:00"
    location: Europe/London

In (A) there are two time interval definitions.

  1. The weekdays Monday through Friday.
  2. The time ranges from 8 o'clock till 20 o'clock.

In (B) there is one time interval defined, the weekdays Monday through Friday and the time range between 8 o'clock and 20 o'clock.

I think the alert manager uses multiple time intervals as a logical OR rather than an AND.

I had the same problem and noticed this subtle difference. I still need to verify that my hunch is correct though ;)

lordievader avatar Jan 19 '24 10:01 lordievader

I struggled with this as well, something that was unclear to me, but the doc states something important for active_time_intervals (and similar for mute_time_intervals):

The route will send notifications only when active, but otherwise acts normally (including ending the route-matching process if the `continue` option is not set).

I was thinking that using active_time_intervals would ignore entirely the route outside the interval, but it only mutes the notifications, the route is still active at all time and will show up in the alertmanager UI when the alerts triggers. Only the notifications will be paused outside the time interval (or during the interval with mute_time_intervals)

Hope that clarifies something for some people like me.

benjy44 avatar Mar 27 '24 08:03 benjy44