alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Issue with time_intervals

Open ThomasWattez opened this issue 1 year ago • 11 comments

I think my AlertManager is having an issue, or I didn't understand well how time intervals are supposed to work.

What did you do?

I struggled a lot to have my AlertManager sending alerts (mail and webhook) only during working hours, because I don't want my colleagues or myself to be spammed at night, on weekends, or even on public holidays.

I defined several time intervals so AlertManager would be active only between 08:30 and 18:30, from Monday to Friday. However, it never worked properly. Each time I gave it a try after some fixes to my configuration, there were two possibilities:

  • or AlertManager was always firing, with no care of my time intervals (really always).
  • or it would never fire, always saying notification was not sent because the route was within mute time, when it was obviously not supposed to be.

I thought long about all the possibilities that made this behavior occur. I finally thought, "Maybe fields are exclusive, which means that if I define weekdays and times into the same time interval, that can't work because it is either checking time or weekday, but can't check both at once." I found that functioning silly, but it would all explain why I am struggling so much after trying so many different things.

I checked your documentation again, and it goes against my idea; it is written: "All fields are lists. Within each non-empty list, at least one element must be satisfied to match the field. If a field is left unspecified, any value will match the field. For an instant of time to match a complete time interval, all fields must match." Which means I am supposed to define several fields into a time interval because it will check if every field is matching, right? However, I gave it a try because it fit so well with the issue I was facing.

I tried those 2 configurations on Friday at 15:00:

First one:

  - name: 'offhours'
    time_intervals:
      - times :
        - start_time: 00:00
          end_time: 08:30
      - location: 'Europe/Paris'

Then, the second one:

  - name: 'offhours'
    time_intervals:
      - times :
        - start_time: 00:00
          end_time: 08:30
      - weekdays: ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
      - location: 'Europe/Paris'

Those time intervals are used on all routes except the root one, as expected. Here is an example of a route:

...........


  routes:


    ############## ROUTES SPECIALES SSL (PAS DE REPETITION) ##############


    - receiver: 'x'
      matchers:
        - severity=~critical|error|warning
        - type=ssl
      # on redéfinit pour ne recevoir l'alerte qu'une seule fois


      mute_time_intervals:
        - offhours


      continue: true


    - receiver: 'xx'
      matchers:
        - type=ssl
      # l'alerte ne sera pas reçue par d'autres receveurs


      mute_time_intervals:
        - offhours


      continue: false


     #################### ROUTES PAR DEFAUT INFRA ########################


    - receiver: 'x'


      matchers:
        - destinataire=infra
        - severity=~critical|error|warning


      mute_time_intervals:
        - offhours
      # définit si l'alerte est transférée aux receveurs suivants ou non
      continue: true


..........

What did you expect to see?

I expected alerts to be sent in both cases. (at Friday 15:00)

What did you see instead? Under which circumstances?

Alerts were fired only in the first case. AlertManager thinks he is within mute time in the second one but not the first one.

With the second definition of 'offhours', logs show :

May 26 15:21:38 xxxxxxxx alertmanager[15532]: ts=2023-05-26T13:21:38.577Z caller=notify.go:826 level=debug component=dispatcher msg="Notifications not sent, route is within mute time"

I don't think I'm doing something wrong. Please correct me if I do, but it seems to be a bug.

Environment

  • System information:

Linux 5.10.0-0.deb10.16-cloud-amd64 x86_64

  • Alertmanager version:

    alertmanager, version 0.25.0 (branch: HEAD, revision: 258fab7cdd551f2cf251ed0348f0ad7289aee789) build user: root@abe866dd5717 build date: 20221222-14:51:36 go version: go1.19.4 platform: linux/amd64

  • Prometheus version:

prometheus, version 2.43.0+stringlabels (branch: HEAD, revision: d8ec24a7c76ab90a2332893fe12fa34e2d6e5570) build user: root@d7c52397d929 build date: 20230321-12:13:15 go version: go1.19.7 platform: linux/amd64 tags: netgo,builtinassets,stringlabels

ThomasWattez avatar May 26 '23 13:05 ThomasWattez

I add that when I ask my system "What day is it ?", it gives me the correct answer.

#date +%A Friday

ThomasWattez avatar May 26 '23 13:05 ThomasWattez

Your time interval syntax is incorrect. location is not a list item, it is a key. So, your first example should instead be:

time_intervals:
- name: offhours
  time_intervals:
  - times:
    - start_time: "00:00"
      end_time: "08:30"
    location: Europe/Paris

You should have seen an error logged when trying to parse your config, since it would have failed syntax validation (which is one of the reasons why the issue template requests that you include logs).

dswarbrick avatar May 29 '23 02:05 dswarbrick

Hello, thanks for your response !

I used amtool everytime I tried a new config, and it succeeded every time, I never had syntax error due to this. However, I will try that and update my post.

ThomasWattez avatar May 30 '23 08:05 ThomasWattez

It does not work, AlertManager is still saying route is within mute time when it is not. Behavior stays the same.

ThomasWattez avatar May 30 '23 09:05 ThomasWattez

Can you verify that the config has been ingested as expected with amtool config show ?

dswarbrick avatar May 30 '23 10:05 dswarbrick

@ThomasWattez If I apply the time interval definition that you originally described, it loads without error, but it's obvious what the problem is. The relevant output from amtool config show:

time_intervals:
- name: offhours
  time_intervals:
  - times:                     # <- first criterion
    - start_time: "00:00"
      end_time: "08:30"
  - location: Europe/Paris     # <- second criterion

Bear in mind that the time_intervals in Alertmanager are a bit confusing. There is the named time_interval (in this case "offhours"), and then within that there can be one or more time_intervals, which can be time ranges, weekdays, days of the month, months, or years - all with an optional timezone.

Your config has resulted in two logical-or criteria within the "offhours" time_interval:

  • restrict the time range to 00:00 - 08:30 UTC (no timezone specified) or
  • open-ended, no restrictions, but with Europe/Paris as the timezone (this would effectively match any time, on any weekday, on any day of the month etc)

Compare that to the subtly different:

time_intervals:
- name: offhours
  time_intervals:
  - times:
    - start_time: "00:00"
      end_time: "08:30"
    location: Europe/Paris

This now specifies one criterion for the "offhours" time_interval:

  • restrict the time range to 00:00 - 08:30 and in the Europe/Paris timezone.

dswarbrick avatar May 30 '23 11:05 dswarbrick

By extension, your second example will specify three logical-or criteria, and also definitely won't do what you expect:

- name: offhours
  time_intervals:
  - times:                                                    # <- first criterion
    - start_time: "00:00"
      end_time: "08:30"
  - weekdays: [monday, tuesday, wednesday, thursday, friday]  # <- second criterion
  - location: Europe/Paris                                    # <- third criterion

What you more likely want is:

- name: offhours
  time_intervals:
  - times:
    - start_time: "00:00"
      end_time: "08:30"
    weekdays: [monday, tuesday, wednesday, thursday, friday]
    location: Europe/Paris

dswarbrick avatar May 30 '23 11:05 dswarbrick

Thanks a lot for your time and your knowledge man. It was really clear and helpful.

It seems to work now ! :)

Indeed, documentation and time_intervals definition are really confusing. I may let this open so prometheus' team can see that misunderstandings about this part will occur, because I don't think I'm the only one who struggled there.

It's a wonderful tool but informations are sometimes hard to find !

Thanks again, I wish you a good day. :)

ThomasWattez avatar May 30 '23 14:05 ThomasWattez

You're most welcome. I'm glad it's now working as you intended.

dswarbrick avatar May 30 '23 14:05 dswarbrick

i have a similar problem with time intervals. Tried using the https://prometheus.io/webtools/alerting/routing-tree-editor/ to parse it using { namespace="app-dev", label_team="TeamA"} as the label set, but it always routes to TeamC when i'd expect it to route to TeamE during office hours, and TeamC on the weekends.


global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by:
  - alertname

  routes:

  - receiver: TeamA
    matchers:
    - label_team="TeamA"
    routes:
      - receiver: TeamC
        matchers:
        - namespace=~".+dev|.+uat"
        active_time_intervals:
        - weekends
      - receiver: TeamE
        matchers:
        - namespace=~".+dev|.+uat"
        active_time_intervals:
        - officehours

  - receiver: TeamB
    matchers:
    - label_team="TeamB"
    routes:
      - receiver: TeamD
        matchers:
        - namespace=~".+dev|.+uat"
        active_time_intervals:
        - weekends

receivers:
- name: default
- name: TeamA
- name: TeamB
- name: TeamC
- name: TeamD
- name: TeamE

time_intervals:

- name: weekends
  time_intervals:
  - weekdays: [saturday, sunday]
    location: Europe/London

- name: officehours
  time_intervals:
  - times:
    - start_time: "08:00"
      end_time: "20:00"
    weekdays: [monday, tuesday, wednesday, thursday, friday]
    location: Europe/London

dtwilliamsWork avatar Jan 09 '24 12:01 dtwilliamsWork

I don't believe the routing tree editor has any concept of time intervals, since there is no way to supply the time at which you want it to simulate the routing decision.

dswarbrick avatar Jan 09 '24 20:01 dswarbrick