alertmanager Mark the fingerprints of all inhibiting rules

Before this change, Alertmanager would mark just the fingerprint of the first inhibition rule to inhibit an alert, rather than the fingerprints of all inhibition rules that inhibit the alert. This is different from silences which mark the IDs of all silences that silence an alert, not just the first silence.

This commit changes how inhibition rules work to match the behavior of silences, such that the fingerprints of all inhibition rules that inhibit the alert are marked.

This change is expected to increase CPU usage as the inhibiter has to do more work, matching the label set against all inhibition rules instead of stopping at the first match. However, this is no different from how silences mute alerts.

Mar 19 '24 20:03 grobinson-grafana

@beorn7 I see the comments were added in this PR here. What were the performance issues observed that caused inhibition rules to behave different from silences?

Mar 19 '24 21:03 grobinson-grafana

I just realized this code is broken when the source_matchers for an inhibition rule matches two or more alerts. In this case, the inhibition rule cache returns the fingerprint of just one of the inhibiting alerts rather than all fingerprints (link).

Mar 20 '24 07:03 grobinson-grafana

Here are the relevant PRs https://github.com/prometheus/alertmanager/pull/379 and https://github.com/prometheus/alertmanager/pull/1774.

Mar 20 '24 07:03 grobinson-grafana

I really don't remember the detailed motivations here, just that it was complicated. Maybe the problems you have encountered above have to do with the reason.

Mar 20 '24 11:03 beorn7

OK so I created a number of benchmarks to better understand how performance would be affected with this change. The four benchmarks do the following:

x_inhibition_rule,_1_inhibiting_alert-8 benchmarks the performance of x inhibition rules, where each inhibition rule has one inhibiting alert.
1_inhibition_rule,_x_inhibiting_alerts-8 benchmarks the performance of 1 inhibition rule, where the inhibition rule has x inhibiting alerts.
100_inhibition_rules,_1000_inhibiting_alerts-8 benchmarks the performance of 100 inhibition rules, which 1000 inhibiting alerts each.
x_inhibition_rules,_last_rule_matches-8 benchmarks the performance of x inhibition rules, where just the last inhibition rule inhibits, and all other inhibition rules are no-ops.

Here are the benchmarks from main:

BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-8         	 1424942	       810.7 ns/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-8       	 1384290	       817.0 ns/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-8      	 1355450	       819.5 ns/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-8     	 1334260	       850.6 ns/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-8    	 1355574	       842.5 ns/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-8       	 1227871	       906.0 ns/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-8      	  490380	      2227 ns/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-8     	   97848	     12087 ns/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-8    	   10827	    104526 ns/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-8  	  100384	     11609 ns/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-8        	 1103008	      1023 ns/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-8       	  345433	      3303 ns/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-8      	   40462	     26343 ns/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-8     	    4449	    252129 ns/op

And here are the benchmarks from this branch:

BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-8         	 1352847	       819.4 ns/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-8       	  283840	      4123 ns/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-8      	   30540	     35556 ns/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-8     	    3244	    353088 ns/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-8    	     290	   3649391 ns/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-8       	  316680	      3682 ns/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-8      	   37304	     28697 ns/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-8     	    4122	    280268 ns/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-8    	     361	   2952918 ns/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-8  	      34	  32002771 ns/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-8        	 1054029	      1065 ns/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-8       	  342032	      3387 ns/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-8      	   38838	     27282 ns/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-8     	    4356	    262349 ns/op

I also benchmarked Mutes for silences:

BenchmarkMutes/1_silence_mutes_alert-8         	 1328642	       867.0 ns/op
BenchmarkMutes/10_silences_mute_alert-8        	  558032	      2151 ns/op
BenchmarkMutes/100_silences_mute_alert-8       	  105117	     11372 ns/op
BenchmarkMutes/1000_silences_mute_alert-8      	    8367	    135013 ns/op
BenchmarkMutes/10000_silences_mute_alert-8     	     745	   1576070 ns/op
BenchmarkQuery/100_silences-8                  	   65757	     18499 ns/op
BenchmarkQuery/1000_silences-8                 	    7033	    164574 ns/op
BenchmarkQuery/10000_silences-8                	     298	   3876756 ns/op

The benchmarks shows the following:

Inhibitor.Mutes is O(N) relative to the number of inhibition rules. The more inhibition rules there are the more comparisons need to be made. No surprises here.
Since in main branch inhibitor.Mutes stops at the first inhibiting rule, when there are lots of inhibition rules it's faster for an alert to be inhibited than not inhibited. This is not the case in this branch though, as we are matching all inhibition rules. No surprises here either.
It seems that having lots of silences doesn't scale well either. In fact silences seem to scale the same as inhibition rules do with the changes in this branch. This is surprising and so I want to double check this further as I know Bjorn did some optimizations here.
With sensible limits, I think we can have reasonable performance for both inhibition rules and silences, where we track all inhibiting alerts and silence IDs up to some sensible limit.

Mar 22 '24 10:03 grobinson-grafana