alertmanager "target-only" inhibition rules only work if there're other alerts firing

What did you do?

Assume I have a setup where I have "trigger alerts" that are basically only used to inhibit other alerts or I want to inhibit whole categories of (noisy) alerts. I have found "target-only" inhibition rules (i.e. rules with only target_matchers but no source_matchers or equal) to be a great "loop hole" for simple "self-inhibiting" rules:

inhibit_rules:
[...]
- target_matchers: [ severity="notify" ]  # i.e. trigger alerts
- target_matchers:
    - device_role="laptop"
    - severity="warning"
    - alertname=~"HostSwapIsFillingUp|HostOutOfMemory|PrometheusTargetMissing":
[...]

What did you expect to see?

These should inhibit the matched target alerts (since there's no overlap with source_matchers the exception for self-inhibiting rules does not apply).

What did you see instead? Under which circumstances?

These (target-only) inhibition rules worked as expected (not sure if intended or not :wink: ) for 6-8 weeks, but started experiencing "racy" behavior. I couldn't exactly pinpoint when it started, but my observation over the last weeks was: if there're only alerts firing which should be inhibited by these target-only rules they will not be inhibited. Once a non-inhibited alert fires the target-only inhibition rules are applied again. This is most obvious when there're only one or two alerts in flight.

All other inhibition rules seem to work without problems.

Environment

System information:

Docker version 20.10.11, build dea9396 on Ubuntu 20.04.3 LTS (GNU/Linux 5.11.0-41-generic x86_64)

Alertmanager version:

$ docker exec alertmanager alertmanager --version
alertmanager, version 0.23.0 (branch: HEAD, revision: 61046b17771a57cfd4c4a51be370ab930a4d7d54)
  build user:       root@e21a959be8d2
  build date:       20210825-10:48:55
  go version:       go1.16.7
  platform:         linux/amd64

Prometheus version:

$ docker exec prometheus prometheus --version
prometheus, version 2.31.1 (branch: HEAD, revision: 411021ada9ab41095923b8d2df9365b632fd40c3)
  build user:       root@9419c9c2d4e0
  build date:       20211105-20:35:02
  go version:       go1.17.3
  platform:         linux/amd64

Dec 07 '21 19:12 riyad

My initial thoughts:

In any case, the behavior should not be "racy". From what you write, the behavior seems undefined when there are no source matchers and/or no equal labels. If that's really the case, it, at the very least, needs to be documented. But I would consider it a bug. We should clearly document what will happen in those cases, and make sure the implementation matches that behavior.

At first glance, I'd understand the current documentation that an inhibition roule without source matcher should never inhibit anything. I guess an argument could be made that an empty equal list should imply that no label values need to be equal to trigger the inhibition. But that has to be thought through (and needs to be documented).

Finally, about your use case of "trigger alerts": Assuming that such a setup makes sense in the first place, I would say it's much cleaner to go via a routing rule rather than an inhibition rule. Add an explicit label, e.g. severity="trigger", and then have a route in your routing tree that sends those alerts to a blackhole receiver.

Dec 14 '21 12:12 beorn7

But I would consider it a bug.

I had a feeling it would come down to this :disappointed:, but I hope I can make a case why this edge case might be useful.

In any case, the behavior should not be "racy".

At the moment it definitely is ... it's been working again for several days now. I really can't put the finger on why/when it's working and when it isn't. :shrug:

tl;dr

There's too little intermediate-level documentation and guidance on patterns and use-cases.
There's little to no paper trail on how an alert was processed (even in the logs). That's why I prefer inhibitions over blackhole routing, because they leave at least an indication that there is something that's being hidden.
IMHO there's a use case for "trigger alerts" (e.g. "batch job is running") for inhibiting alerts temporarily

I accept that I may be an idiot and that I'm talking out of ignorance :clown_face: ... which is (hopefully) emphasizing my first point. :stuck_out_tongue_closed_eyes: Sorry for the lengthy answer below. :sweat_smile:

Where to apply logic?

I'm new to the Prometheus/Alertmanager ecosystem and the way I see it there're basically 3 levels at which logic can be used:

Prometheus alert rules
Alertmanager inhibition rules
Alertmanager routing rules

I was left with the impression that Prometheus alerts should be simple/more general and I could use extra labels for Prometheus targets to filter and route alerts in Alertmanager.

So where do I put "overarching" rules e.g. alert when RAM or SWAP gets exhausted, but not for developer workstations and laptops. Does it make sense to

extend Prometheus alerting rules with these conditions?
create "trigger alerts" in Prometheus with these conditions?
use Alertmanager inhibitions (either with "trigger alerts" or with "target-only" rules)?
use Alertmanager routing?

Which of the mechanisms do I use when? IIRC the documentation doesn't give any guidance here.

Just by explaining this I have a feeling that there's definitely room for better intermediate-level documentation (i.e. more elaboration on best parctices) ... e.g. a collection of links to Stack Exchange/Server Fault, blog posts, discussions, etc.

A case for inhibition over blackhole routing

Finally, about your use case of "trigger alerts": Assuming that such a setup makes sense in the first place, I would say it's much cleaner to go via a routing rule rather than an inhibition rule. Add an explicit label, e.g. severity="trigger", and then have a route in your routing tree that sends those alerts to a blackhole receiver.

If I want certain alerts to "reach nobody" do I "inhibit" those alerts or do I route them to a receiver called "nobody" that happens to reach nobody? I hope my polemic phrasing makes it clear that from a novice's point of view it's neither "clean" nor obvious to go for the routing option. So, I actually use blackhole routing as a workaround for the "racy" inhibitions, but I'm not particularly happy about it. (explanation below) :neutral_face:

I totally understand the reasons to prohibit self-inhibition of alerts, because of the potential damage they can cause and the support troubles they will bring to the Alertmanager project itself. But target-only inhibition rules provide a "clean" (using your words against you :stuck_out_tongue_closed_eyes:), non-ambiguous and IMHO useful edge case. :wink: That's why I'm hoping "target-only" inhibitions will become an accepted and documented edge case. :pray:

IMHO one of the problems is missing transparency of what happens to an alert inside Alertmanager. AFAIK even in with --log.level=debug there's ZERO indication on how an alert was processed:

what inhibition rule matched (or was explicitly skipped because of potential self-inhibition)?
what silencing rule matched?
what routers matched?

There's definitely a paper trail missing ... which also makes debugging rules unnecessary guesswork. IMHO that's why inhibition is my preferred choice (in contrast to "blackhole routing"). I can see that something was hidden from me if I needed to (same applies to silences).

Dec 17 '21 00:12 riyad