alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Alert grouping test with amtool

Open freeseacher opened this issue 2 years ago • 2 comments

What did you do? Now we are activly using amtool config routes test and find it extremely usefull, but recently found that we should check if alert grouping is expected too. for example now we are checking that

% amtool config routes test --config.file alertmanager.yaml --tree \
--verify.receivers wire-team-opsgenie 'team=wire'
Matching routes:
.
└── default-route
    └── {team=~"^(?:^(wire)$)$"}  receiver: wire-team-opsgenie
wire-team-opsgenie

it will be usefull if we can pass something like

% amtool config routes test --config.file alertmanager.yaml \
--tree --verify.receivers wire-team-opsgenie \
--verify.grouping=env,cluster,priority 'team=wire'

Matching routes:
.
└── default-route
    └── {team=~"^(?:^(wire)$)$"}  receiver: wire-team-opsgenie
wire-team-opsgenie, grouping: [env,cluster,priority]

freeseacher avatar Jul 13 '22 14:07 freeseacher

I'm not sure I follow the usefulness of this - on your example where you include the grouping, what changed?

gotjosh avatar Jul 14 '22 09:07 gotjosh

the main reason of it is for routing with custom subroutes. for example i have something like

- receiver: wire-team-opsgenie
  group_by:
    - env
    - cluster
    - priority
  match_re:
    team: ^(wire)$
  routes:
    - receiver: wire-team-opsgenie
      group_by:
        - alertname
        - cve
        - cluster
      match:
        alert_topic: security
    - receiver: wire-team-opsgenie
      group_by:
        - alertname
        - service
        - project
        - team
      match:
        alertname: QuotaCanBeReached

You can see each alert will be sent to same receiver but with different grouping. After opsgenie we create jira issue and alert grouping is a key to know we already had the same incident previously. So instead of opening new jira issue we can append to already created. That is why its crucial to check if grouping is correct when changing am configs.

i propose two things

  1. show reciever grouping when displaying routing tree may be here https://github.com/prometheus/alertmanager/blob/main/cli/routing.go#L89
{team=~"^(?:^(wire)$)$"}  receiver: wire-team-opsgenie
wire-team-opsgenie, *grouping: [env,cluster,priority]*
  1. add new key verify.grouping that can check if receiver got expected grouping. maybe something like --verify.grouping[0]=[alertname,cve,cluster] will do the trick

freeseacher avatar Jul 14 '22 10:07 freeseacher