Mark the fingerprints of all inhibiting rules
Before this change, Alertmanager would mark just the fingerprint of the first inhibition rule to inhibit an alert, rather than the fingerprints of all inhibition rules that inhibit the alert. This is different from silences which mark the IDs of all silences that silence an alert, not just the first silence.
This commit changes how inhibition rules work to match the behavior of silences, such that the fingerprints of all inhibition rules that inhibit the alert are marked.
This change is expected to increase CPU usage as the inhibiter has to do more work, matching the label set against all inhibition rules instead of stopping at the first match. However, this is no different from how silences mute alerts.
@beorn7 I see the comments were added in this PR here. What were the performance issues observed that caused inhibition rules to behave different from silences?
I just realized this code is broken when the source_matchers for an inhibition rule matches two or more alerts. In this case, the inhibition rule cache returns the fingerprint of just one of the inhibiting alerts rather than all fingerprints (link).
Here are the relevant PRs https://github.com/prometheus/alertmanager/pull/379 and https://github.com/prometheus/alertmanager/pull/1774.
I really don't remember the detailed motivations here, just that it was complicated. Maybe the problems you have encountered above have to do with the reason.
OK so I created a number of benchmarks to better understand how performance would be affected with this change. The four benchmarks do the following:
x_inhibition_rule,_1_inhibiting_alert-8benchmarks the performance ofxinhibition rules, where each inhibition rule has one inhibiting alert.1_inhibition_rule,_x_inhibiting_alerts-8benchmarks the performance of 1 inhibition rule, where the inhibition rule hasxinhibiting alerts.100_inhibition_rules,_1000_inhibiting_alerts-8benchmarks the performance of 100 inhibition rules, which 1000 inhibiting alerts each.x_inhibition_rules,_last_rule_matches-8benchmarks the performance ofxinhibition rules, where just the last inhibition rule inhibits, and all other inhibition rules are no-ops.
Here are the benchmarks from main:
BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-8 1424942 810.7 ns/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-8 1384290 817.0 ns/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-8 1355450 819.5 ns/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-8 1334260 850.6 ns/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-8 1355574 842.5 ns/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-8 1227871 906.0 ns/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-8 490380 2227 ns/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-8 97848 12087 ns/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-8 10827 104526 ns/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-8 100384 11609 ns/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-8 1103008 1023 ns/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-8 345433 3303 ns/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-8 40462 26343 ns/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-8 4449 252129 ns/op
And here are the benchmarks from this branch:
BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-8 1352847 819.4 ns/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-8 283840 4123 ns/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-8 30540 35556 ns/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-8 3244 353088 ns/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-8 290 3649391 ns/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-8 316680 3682 ns/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-8 37304 28697 ns/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-8 4122 280268 ns/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-8 361 2952918 ns/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-8 34 32002771 ns/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-8 1054029 1065 ns/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-8 342032 3387 ns/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-8 38838 27282 ns/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-8 4356 262349 ns/op
I also benchmarked Mutes for silences:
BenchmarkMutes/1_silence_mutes_alert-8 1328642 867.0 ns/op
BenchmarkMutes/10_silences_mute_alert-8 558032 2151 ns/op
BenchmarkMutes/100_silences_mute_alert-8 105117 11372 ns/op
BenchmarkMutes/1000_silences_mute_alert-8 8367 135013 ns/op
BenchmarkMutes/10000_silences_mute_alert-8 745 1576070 ns/op
BenchmarkQuery/100_silences-8 65757 18499 ns/op
BenchmarkQuery/1000_silences-8 7033 164574 ns/op
BenchmarkQuery/10000_silences-8 298 3876756 ns/op
The benchmarks shows the following:
Inhibitor.MutesisO(N)relative to the number of inhibition rules. The more inhibition rules there are the more comparisons need to be made. No surprises here.- Since in
mainbranchinhibitor.Mutesstops at the first inhibiting rule, when there are lots of inhibition rules it's faster for an alert to be inhibited than not inhibited. This is not the case in this branch though, as we are matching all inhibition rules. No surprises here either. - It seems that having lots of silences doesn't scale well either. In fact silences seem to scale the same as inhibition rules do with the changes in this branch. This is surprising and so I want to double check this further as I know Bjorn did some optimizations here.
- With sensible limits, I think we can have reasonable performance for both inhibition rules and silences, where we track all inhibiting alerts and silence IDs up to some sensible limit.