alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Signficantly improve inhibitor performance via new cache datastructure

Open Spaceman1701 opened this issue 1 year ago • 6 comments

This PR improves or matches performance of the existing inhibitor in almost all benchmark cases. When each inhibition rule has many inhibiting alerts, performance is improved by several orders of magnitude. No change to the inhibitor interface is necessary.

                                                     │    main.txt    │    icache-for-upstream-final.txt    │
                                                     │     sec/op     │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4            1.865µ ±  4%   1.597µ ±  7%  -14.35% (p=0.002 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4          2.007µ ±  7%   1.464µ ±  5%  -27.04% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4         1.941µ ±  7%   1.447µ ± 16%  -25.46% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4        1.884µ ±  3%   1.422µ ± 10%  -24.55% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4       1.879µ ± 13%   1.419µ ±  4%  -24.49% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4          1.973µ ±  8%   1.404µ ±  4%  -28.86% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4         3.726µ ± 10%   1.348µ ± 13%  -63.84% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4       21.485µ ±  5%   1.411µ ±  3%  -93.43% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4     184.245µ ±  9%   1.489µ ±  6%  -99.19% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4    20.453µ ± 14%   1.409µ ±  5%  -93.11% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4           2.226µ ±  8%   1.912µ ±  8%  -14.13% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4          6.959µ ±  2%   6.293µ ±  5%   -9.57% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4         55.08µ ±  4%   48.42µ ±  3%  -12.10% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4        534.1µ ±  7%   488.9µ ±  4%   -8.45% (p=0.002 n=6)
geomean                                                  8.267µ         3.181µ        -61.52%                                            8.267µ         3.364µ        -59.31%

The new implementation replaces the InhibitRule's scache with an icache (meaning "intersection cache") which pre-calculates the equal label values a target alert would need to be inhibited by each inhibitor alert. Practically, this is implemented by calculating a fingerprint for just the equalLabels of each inhibiting alert and storing that alert in a map keyed by the equalLabels fingerprint. This allows O(1) access to the set of inhibiting alerts for an InhibitRule in the Inhibitor.Mutes method. The new data structure is essentially just:

map[model.Fingerpint]map[model.Fingerprint]*types.Alert
// equaLabelsFp -> fp -> alert

To insert a new inhibitor alert "A":

  1. does "A" match the source labels? If not, exit
  2. make a LabelSet of just the subset of A's labels which are in the InhibitRule's equal_labels. Compute the fingerpint for this LabelSet and call it "E"
  3. find the entry in the icache for E and insert A into it using A's fingerprint

To check if an alert "T" is inhibited:

  1. does T match the target labels? If not, exit. T is not inhibited
  2. make a LabelSet of just the subset of T's labels which are in the InhibitRule's equal_labels. Compute the fingerpint for this LabelSet and call it "E"
  3. does an entry exist in the InhibitRule's icache for E? If not, exit. T is not inhibited
  4. Find the first non-resolved alert in the icache entry and return it. T is inhibited by this alert.

The new implementation has a few other minor improvements which help performance: it reduces the number of time an inhibitor alert's fingerprint is calculated by caching it in the icache, it pre-calculates whether an inhibitor matches the source and target matchers for an inhibit rules, it calls time.Now() less frequently (similar to #4119), and it avoids checking if a target alert matches the source matchers unless necessary.

In the real world, this change helps most when an inhibit rule covers many inhibitors and target alerts. For example, a rule where all critical alerts inhibit all warning alerts with the same instance and alertname can result in many possible inhibitors attached to a rule where each inhibitor alert only inhibits one target alert. The old implementation of the inhibitor is extremely inefficient in this case: assuming N inhibiting alerts and a fixed number of equalLabels, it performs O(N) string comparisons per target. The new implementation is effectively O(1) in this case (except in a pathological case where all alerts in the inhibitor are resolved, since the new implementation still has to iterate past resolved alerts).

The new implementation does require slightly more memory per inhibiting alert than the old one: we store the alert's fingerprint and a boolean to check if the inhibitor matches both the source and the target. This works out to be about 16 extra bytes per cached inhibitor alert after accounting for padding. For 10,000 alerts, that's about 160KB. I think this is a totally acceptable increase in memory requirements.

We've been running a very similar patch in our production environment which adds all inhibiting alerts to the marker for more than a month. In our environment this still results in a more then 4x reduction in time spent in the Inhibitor.Mutes method.

This change was motivated by observations which show a large amount of CPU spent doing string comparisons in the inhibitor. It seems like we're not the only ones who have been looking for this, and I think this might be a good alternative to #3933.

Spaceman1701 avatar Nov 22 '24 20:11 Spaceman1701

cc @grobinson-grafana or @gotjosh is there any interest in this PR? Or something I can do to make this easier to review?

Spaceman1701 avatar Dec 18 '24 17:12 Spaceman1701

I just updated the PR to include a new benchmark that's the same as 1_inhibition_rule,_X_inhibiting_alerts that uses the Equal field so that the target labelset is only inhibit by the last source alert. I think this isolates the O(1) behavior of the new implementation pretty well.

Here's a quick 1 iteration run on my machine:

$ benchstat 1-run-main.txt 1-run-patch.txt
goos: linux
goarch: amd64
pkg: github.com/prometheus/alertmanager/inhibit
cpu: AMD EPYC Processor (with IBPB)
                                                                   │ 1-run-main.txt │            1-run-patch.txt            │
                                                                   │     sec/op     │    sec/op     vs base                 │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                          1.878µ ± ∞ ¹   1.824µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/10_inhibition_rules,_1_inhibiting_alert-4                        1.893µ ± ∞ ¹   1.758µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/100_inhibition_rules,_1_inhibiting_alert-4                       1.890µ ± ∞ ¹   1.663µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4                      2.108µ ± ∞ ¹   1.754µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4                     1.815µ ± ∞ ¹   1.513µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4                        1.902µ ± ∞ ¹   1.511µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4                       3.561µ ± ∞ ¹   1.505µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4                     23.580µ ± ∞ ¹   1.608µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4                   177.102µ ± ∞ ¹   1.573µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_10_inhibiting_alerts_w/_random_match-4        2.480µ ± ∞ ¹   2.155µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_100_inhibiting_alerts_w/_random_match-4       7.440µ ± ∞ ¹   2.228µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_random_match-4     23.019µ ± ∞ ¹   1.698µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_random_match-4   176.798µ ± ∞ ¹   1.740µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4                  19.699µ ± ∞ ¹   1.455µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/10_inhibition_rules,_last_rule_matches-4                         2.834µ ± ∞ ¹   1.989µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/100_inhibition_rules,_last_rule_matches-4                       13.936µ ± ∞ ¹   7.869µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1000_inhibition_rules,_last_rule_matches-4                      123.32µ ± ∞ ¹   74.61µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/10000_inhibition_rules,_last_rule_matches-4                     1238.7µ ± ∞ ¹   602.5µ ± ∞ ¹        ~ (p=1.000 n=1) ²
geomean                                                                11.09µ         3.192µ        -71.23%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

I'll also setup a version that just switches scache to a map[model.Fingerprint]*types.Alert and see how it does in comparison.

Spaceman1701 avatar Apr 07 '25 15:04 Spaceman1701

Ah, I typo-d the benchmark function name for 1_inhibition_rule,_1000_inhibiting_alerts_w/_random_match-4 and 1_inhibition_rule,_10000_inhibiting_alerts_w/_random_match-4 which explains the strange scaling there.

I just pushed a fixed version and things look more logical:

$ benchstat 1-run-remove-store.txt 1-run-patch.txt
goos: linux
goarch: amd64
pkg: github.com/prometheus/alertmanager/inhibit
cpu: AMD EPYC Processor (with IBPB)
                                                                 │ 1-run-remove-store.txt │            1-run-patch.txt            │
                                                                 │         sec/op         │    sec/op     vs base                 │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                                1.753µ ± ∞ ¹   1.515µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/10_inhibition_rules,_1_inhibiting_alert-4                              1.677µ ± ∞ ¹   1.476µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/100_inhibition_rules,_1_inhibiting_alert-4                             1.727µ ± ∞ ¹   1.425µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4                            1.596µ ± ∞ ¹   1.425µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4                           1.671µ ± ∞ ¹   1.435µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4                              1.882µ ± ∞ ¹   1.665µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4                             1.881µ ± ∞ ¹   1.349µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4                            2.012µ ± ∞ ¹   1.404µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4                           2.146µ ± ∞ ¹   1.534µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-4                2.501µ ± ∞ ¹   2.126µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-4               6.389µ ± ∞ ¹   2.046µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-4             54.494µ ± ∞ ¹   2.575µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-4           637.438µ ± ∞ ¹   1.746µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4                         1.696µ ± ∞ ¹   1.497µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/10_inhibition_rules,_last_rule_matches-4                               2.862µ ± ∞ ¹   2.153µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/100_inhibition_rules,_last_rule_matches-4                             13.952µ ± ∞ ¹   7.199µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/1000_inhibition_rules,_last_rule_matches-4                            120.72µ ± ∞ ¹   64.23µ ± ∞ ¹        ~ (p=1.000 n=1) ²
Mutes/10000_inhibition_rules,_last_rule_matches-4                           1191.2µ ± ∞ ¹   608.0µ ± ∞ ¹        ~ (p=1.000 n=1) ²
geomean                                                                      6.855µ         3.062µ        -55.34%

The baseline version has store.Alerts replaced with a plain map, so the performance improvement is not caused by saving allocations from scache.List().

Spaceman1701 avatar Apr 07 '25 16:04 Spaceman1701

Just for completeness, here's a full 6 iteration run which compares the implementation on main with the two local changes we discussed in the CNCF slack:

  • store.Alerts replaced with a map
  • similar excludeSourceAndTargetMatch logic as in this PR (so that the SourceMatchers.Matches call is lazy and the TargetMatchers.Matches call is pre-computed)
$ benchstat full-bench-remove-store-updated.txt full-bench-patch.txt
goos: linux
goarch: amd64
pkg: github.com/prometheus/alertmanager/inhibit
cpu: AMD EPYC Processor (with IBPB)
                                                                 │ full-bench-remove-store-updated.txt │        full-bench-patch.txt         │
                                                                 │               sec/op                │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                                             1.712µ ±  6%   1.410µ ± 31%        ~ (p=0.065 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4                                           1.792µ ±  5%   1.325µ ± 18%  -26.04% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4                                          1.707µ ± 20%   1.401µ ± 11%  -17.96% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4                                         1.756µ ± 30%   1.433µ ± 29%  -18.42% (p=0.026 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4                                        1.925µ ± 53%   1.987µ ± 52%        ~ (p=0.699 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4                                           1.800µ ± 10%   1.594µ ± 55%        ~ (p=0.240 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4                                          2.232µ ± 29%   1.456µ ± 16%  -34.74% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4                                         1.724µ ± 27%   1.397µ ±  8%  -18.97% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4                                        1.991µ ± 34%   1.419µ ± 29%  -28.73% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-4                             2.418µ ±  8%   1.782µ ±  8%  -26.31% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-4                            6.822µ ± 13%   1.755µ ± 20%  -74.28% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-4                          53.252µ ±  7%   1.834µ ± 29%  -96.56% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-4                        635.912µ ± 24%   1.860µ ± 22%  -99.71% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4                                      1.686µ ±  6%   1.388µ ±  5%  -17.68% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4                                            2.736µ ± 17%   1.986µ ±  8%  -27.40% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4                                          13.307µ ±  7%   8.368µ ± 13%  -37.12% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4                                         123.30µ ±  8%   60.93µ ±  8%  -50.58% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4                                        1162.8µ ± 14%   616.1µ ±  6%  -47.02% (p=0.002 n=6)
geomean                                                                                   6.882µ         2.966µ        -56.90%

Spaceman1701 avatar Apr 07 '25 17:04 Spaceman1701

One last commit addressing benchmark performance for now - I found that the different in the _last_rule_matches performance between this PR and the souped-up main branch version is mostly calls to time.Now. The main implementation is calling time.Now much more frequently in that scenario.

I patched both that version and this PR to exit hasEqual and findInhibitor early if the InhibitRule cache is emtpy. Here's the new benchmark results for that set of patches:

                                                                 │ full-bench-remove-store-updated_v2.txt │       full-bench-patch_v2.txt       │
                                                                 │                 sec/op                 │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                                                1.874µ ± 14%   1.565µ ± 30%  -16.49% (p=0.041 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4                                              1.679µ ±  6%   1.946µ ± 33%        ~ (p=0.485 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4                                             1.646µ ± 17%   1.436µ ± 32%  -12.73% (p=0.037 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4                                            1.708µ ±  8%   1.423µ ± 55%        ~ (p=0.058 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4                                           1.800µ ± 22%   1.456µ ± 23%  -19.11% (p=0.015 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4                                              1.799µ ±  9%   1.528µ ±  8%  -15.09% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4                                             1.718µ ± 16%   1.437µ ±  7%  -16.36% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4                                            1.747µ ±  8%   1.569µ ±  9%  -10.19% (p=0.026 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4                                           2.009µ ± 15%   1.572µ ± 14%  -21.75% (p=0.004 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-4                                2.547µ ± 47%   1.886µ ± 19%  -25.95% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-4                               7.097µ ±  9%   1.901µ ± 12%  -73.22% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-4                             50.929µ ±  8%   1.931µ ± 15%  -96.21% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-4                           580.522µ ±  8%   1.969µ ± 12%  -99.66% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4                                         1.777µ ± 25%   1.404µ ± 11%  -21.02% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4                                               2.113µ ± 36%   1.743µ ± 26%  -17.49% (p=0.026 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4                                              6.084µ ±  5%   4.237µ ± 17%  -30.36% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4                                             46.20µ ± 14%   30.51µ ±  6%  -33.97% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4                                            416.1µ ± 31%   305.5µ ± 11%  -26.56% (p=0.002 n=6)
geomean                                                                                      5.701µ         2.716µ        -52.36%

The PR's implementation is still a bit faster in _last_rule_matches in this run. My profiler seems to think this is related to the time it takes to acquire the locks. I don't think we'd see any difference in the real world where there's contention for the lock.

Overall, this PR's implementation is faster than what's on main in basically all scenarios, and still orders of magnitude faster than the souped-up main algorithm in cases where each inhibit rule has a lot of source alerts.

Spaceman1701 avatar Apr 07 '25 19:04 Spaceman1701

And just for completeness, here's the comparison between the PR and a totally unpatched main branch:

                                                                 │ full-bench-main.txt │       full-bench-patch_v2.txt       │
                                                                 │       sec/op        │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                             1.811µ ± 14%   1.565µ ± 30%  -13.61% (p=0.041 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4                           1.690µ ±  4%   1.946µ ± 33%        ~ (p=0.589 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4                          1.770µ ± 28%   1.436µ ± 32%  -18.85% (p=0.037 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4                         1.828µ ±  6%   1.423µ ± 55%        ~ (p=0.065 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4                        1.848µ ± 55%   1.456µ ± 23%  -21.19% (p=0.004 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4                           1.960µ ± 22%   1.528µ ±  8%  -22.05% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4                          3.700µ ± 14%   1.437µ ±  7%  -61.18% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4                        25.883µ ± 10%   1.569µ ±  9%  -93.94% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4                      183.483µ ± 30%   1.572µ ± 14%  -99.14% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-4             2.526µ ±  8%   1.886µ ± 19%  -25.34% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-4            8.404µ ±  8%   1.901µ ± 12%  -77.39% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-4          72.286µ ±  9%   1.931µ ± 15%  -97.33% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-4        794.827µ ± 18%   1.969µ ± 12%  -99.75% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4                     21.776µ ±  8%   1.404µ ± 11%  -93.55% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4                            2.759µ ±  4%   1.743µ ± 26%  -36.82% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4                          14.541µ ±  8%   4.237µ ± 17%  -70.86% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4                         127.45µ ±  6%   30.51µ ±  6%  -76.06% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4                        1258.8µ ±  9%   305.5µ ± 11%  -75.73% (p=0.002 n=6)
geomean                                                                   12.97µ         2.716µ        -79.06%

Spaceman1701 avatar Apr 07 '25 21:04 Spaceman1701

Inhibition performance is significantly improved in #4607 It uses an index instead of a cache. See this comment on how the index works.

@Spaceman1701 can you rebase this implementation and check if the cache provides performance gains on top of the index and if it's still worth doing?

Edit: looking at the PR further, it seems to be a similar idea but implemented differently. So probably we can close this.

siavashs avatar Oct 22 '25 09:10 siavashs

Hey @siavashs , at first glance it looks like the implementation of the index is very similar to the implementation of the icache in this change. E.g. makeCacheKey here and fingerprintEquals are pretty much the same. I'm skimming #4607 here, but I think the primary difference is that this change has one datastructure that is the scache and the index. Internally, our version of alertmanager uses this so that it can return the fingerprints of all inhibiting alerts. The other minor differences in this change:

  • the matcher evaluations are cached for two sided matches
  • calls to time.Now are factored out of inner loops These make small improvements to performance. Given the similarities between #4607 and this change, it's probably easier to close this PR and implement those improvements ontop of what exists here. Unfortunately this isn't something I'll be able to do unless/if we're able to adopt the implementation that's been merged.

The one thing I don't understand is why the benchmarks in this PR seem better at a glance. It's possible this is down to the different machines we tested on.

Would you mind running the inhibitor benchmarks for this PR on your machine? If it still looks substantially different than the benchmarks for #4607, I can take a closer look to try to understand...

Spaceman1701 avatar Oct 22 '25 09:10 Spaceman1701

Here are the benchmarks I did for this branch (in case you want to compare with yours):

Details

~/D/a/alertmanager (icache-with-first-result|✔) $ go test -bench=. -run='^$' -count 7 -benchmem ./inhibit/ | tee benchmark-inhibit-cache.txt
goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/inhibit
cpu: Apple M3 Pro
BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-12         	 2549952	       429.3 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-12         	 2542627	       430.5 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-12         	 2352310	       435.1 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-12         	 2625428	       434.6 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-12         	 2573762	       428.5 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-12         	 2461554	       426.2 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1_inhibiting_alert-12         	 2399557	       428.8 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-12       	 2341170	       429.3 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-12       	 2583420	       427.0 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-12       	 2442073	       430.0 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-12       	 2506743	       426.9 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-12       	 2377144	       436.9 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-12       	 2359090	       441.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_1_inhibiting_alert-12       	 2466920	       436.7 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-12      	 2343828	       438.0 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-12      	 2412193	       434.1 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-12      	 2393886	       434.1 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-12      	 2749725	       443.5 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-12      	 2432582	       436.2 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-12      	 2458640	       432.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1_inhibiting_alert-12      	 2364182	       433.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-12     	 2381535	       469.3 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-12     	 2245339	       490.3 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-12     	 2288902	       466.3 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-12     	 2076728	       520.5 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-12     	 2091968	       539.7 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-12     	 2007397	       537.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_1_inhibiting_alert-12     	 1971786	       527.9 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-12    	 2130751	       590.9 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-12    	 1946528	       596.6 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-12    	 1963693	       571.6 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-12    	 2125946	       569.6 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-12    	 1999339	       578.6 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-12    	 2062027	       574.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_1_inhibiting_alert-12    	 1971615	       585.1 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-12       	 2241342	       476.6 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-12       	 2341010	       470.5 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-12       	 2333052	       468.1 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-12       	 2325962	       473.0 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-12       	 2372875	       465.1 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-12       	 2374576	       467.8 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts-12       	 2292949	       468.1 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-12      	 2446972	       467.3 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-12      	 2345157	       462.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-12      	 2564450	       468.2 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-12      	 2285212	       464.7 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-12      	 2402594	       471.7 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-12      	 2269624	       470.2 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts-12      	 2383495	       464.9 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-12     	 2263526	       473.1 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-12     	 2403828	       467.0 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-12     	 2276610	       471.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-12     	 2391637	       468.1 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-12     	 2182570	       471.5 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-12     	 2562704	       469.2 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts-12     	 2214241	       473.0 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-12    	 2218141	       473.9 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-12    	 2128590	       474.2 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-12    	 2177030	       472.6 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-12    	 2245520	       469.5 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-12    	 2256829	       476.1 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-12    	 2179735	       471.8 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts-12    	 2163909	       489.1 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-12         	 1875188	       599.7 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-12         	 1787878	       602.4 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-12         	 1911996	       602.1 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-12         	 1819026	       601.3 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-12         	 1801441	       599.9 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-12         	 1797752	       603.3 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-12         	 1826916	       607.0 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-12        	 1794582	       606.3 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-12        	 1865400	       605.4 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-12        	 1661011	       613.3 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-12        	 1829118	       604.7 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-12        	 1835413	       602.5 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-12        	 1772190	       599.5 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-12        	 1820858	       606.5 ns/op	     488 B/op	      10 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-12       	 1757442	       635.8 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-12       	 1733418	       612.2 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-12       	 1824864	       612.1 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-12       	 1822245	       609.4 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-12       	 1847073	       608.8 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-12       	 1786066	       609.8 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-12       	 1760792	       603.3 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-12      	 1750762	       604.6 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-12      	 1774530	       607.7 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-12      	 1750663	       606.0 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-12      	 1729974	       613.4 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-12      	 1734894	       610.4 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-12      	 1770200	       608.6 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-12      	 1775438	       606.9 ns/op	     496 B/op	      11 allocs/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-12                  	 2305063	       448.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-12                  	 2237691	       453.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-12                  	 2370349	       454.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-12                  	 2259199	       450.8 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-12                  	 2310182	       463.6 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-12                  	 2264394	       454.2 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-12                  	 2294893	       454.3 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-12                        	 1899793	       556.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-12                        	 1913986	       553.9 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-12                        	 1937922	       554.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-12                        	 1871246	       562.0 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-12                        	 1906359	       553.7 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-12                        	 1898174	       552.4 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10_inhibition_rules,_last_rule_matches-12                        	 2000439	       559.5 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-12                       	  720709	      1540 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-12                       	  697876	      1554 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-12                       	  713967	      1610 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-12                       	  685314	      1549 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-12                       	  718411	      1553 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-12                       	  692836	      1556 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-12                       	  717962	      1559 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-12                      	   87838	     12278 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-12                      	   96835	     12283 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-12                      	   92163	     12381 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-12                      	   94401	     13066 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-12                      	   92577	     13543 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-12                      	   91092	     12809 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-12                      	   92542	     14305 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-12                     	    8779	    116967 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-12                     	   10286	    121807 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-12                     	    8880	    123027 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-12                     	    9526	    126077 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-12                     	    8800	    126273 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-12                     	    9118	    134277 ns/op	     432 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-12                     	    8672	    124294 ns/op	     432 B/op	       8 allocs/op
PASS
ok  	github.com/prometheus/alertmanager/inhibit	831.430s

And comparison with the current main:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/inhibit
cpu: Apple M3 Pro
                                                                  │ benchmark-inhibit-main.txt │     benchmark-inhibit-cache.txt     │
                                                                  │           sec/op           │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-12                                    487.0n ±  1%   429.3n ±  1%  -11.85% (p=0.001 n=7)
Mutes/10_inhibition_rules,_1_inhibiting_alert-12                                  494.6n ±  1%   430.0n ±  3%  -13.06% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1_inhibiting_alert-12                                 511.8n ±  3%   434.1n ±  2%  -15.18% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-12                                586.2n ±  4%   520.5n ± 10%  -11.21% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-12                               656.4n ±  1%   578.6n ±  3%  -11.85% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-12                                  566.6n ±  2%   468.1n ±  2%  -17.38% (p=0.001 n=7)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-12                                 567.2n ±  2%   467.3n ±  1%  -17.61% (p=0.001 n=7)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-12                                566.7n ± 13%   471.4n ±  1%  -16.82% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-12                               564.2n ±  7%   473.9n ±  3%  -16.00% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-12                             544.9n ±  1%   454.2n ±  2%  -16.65% (p=0.001 n=7)
Mutes/10_inhibition_rules,_last_rule_matches-12                                  1058.0n ±  2%   554.4n ±  1%  -47.60% (p=0.001 n=7)
Mutes/100_inhibition_rules,_last_rule_matches-12                                  6.175µ ±  3%   1.554µ ±  4%  -74.83% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_last_rule_matches-12                                 60.00µ ±  4%   12.81µ ± 12%  -78.65% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_last_rule_matches-12                                606.3µ ±  3%   124.3µ ±  8%  -79.50% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-12                                   602.1n ±  1%
Mutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-12                                  605.4n ±  1%
Mutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-12                                 609.8n ±  4%
Mutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-12                                607.7n ±  1%
geomean                                                                           1.585µ         879.6n        -38.28%

                                                                  │ benchmark-inhibit-main.txt │    benchmark-inhibit-cache.txt    │
                                                                  │            B/op            │    B/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-12                                      488.0 ± 0%   432.0 ± 0%  -11.48% (p=0.001 n=7)
Mutes/10_inhibition_rules,_1_inhibiting_alert-12                                    488.0 ± 0%   432.0 ± 0%  -11.48% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1_inhibiting_alert-12                                   488.0 ± 0%   432.0 ± 0%  -11.48% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-12                                  488.0 ± 0%   432.0 ± 0%  -11.48% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-12                                 488.0 ± 0%   432.0 ± 0%  -11.48% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-12                                    488.0 ± 0%   432.0 ± 0%  -11.48% (p=0.001 n=7)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-12                                   488.0 ± 0%   432.0 ± 0%  -11.48% (p=0.001 n=7)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-12                                  488.0 ± 0%   432.0 ± 0%  -11.48% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-12                                 488.0 ± 0%   432.0 ± 0%  -11.48% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-12                               488.0 ± 0%   432.0 ± 0%  -11.48% (p=0.001 n=7)
Mutes/10_inhibition_rules,_last_rule_matches-12                                     472.0 ± 0%   432.0 ± 0%   -8.47% (p=0.001 n=7)
Mutes/100_inhibition_rules,_last_rule_matches-12                                    472.0 ± 0%   432.0 ± 0%   -8.47% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_last_rule_matches-12                                   472.0 ± 0%   432.0 ± 0%   -8.47% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_last_rule_matches-12                                  473.0 ± 0%   432.0 ± 0%   -8.67% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-12                                   488.0 ± 0%
Mutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-12                                  488.0 ± 0%
Mutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-12                                 496.0 ± 0%
Mutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-12                                496.0 ± 0%
geomean                                                                             483.4        444.7       -10.64%

                                                                  │ benchmark-inhibit-main.txt │    benchmark-inhibit-cache.txt    │
                                                                  │         allocs/op          │ allocs/op   vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-12                                     10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/10_inhibition_rules,_1_inhibiting_alert-12                                   10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1_inhibiting_alert-12                                  10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-12                                 10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-12                                10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-12                                   10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-12                                  10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-12                                 10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-12                                10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-12                              10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/10_inhibition_rules,_last_rule_matches-12                                    10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/100_inhibition_rules,_last_rule_matches-12                                   10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_last_rule_matches-12                                  10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_last_rule_matches-12                                 10.000 ± 0%   8.000 ± 0%  -20.00% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-12                                   10.00 ± 0%
Mutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-12                                  10.00 ± 0%
Mutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-12                                 11.00 ± 0%
Mutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-12                                11.00 ± 0%
geomean                                                                             10.00        8.496       -20.00%

There are performance improvements in some cases.

siavashs avatar Oct 22 '25 10:10 siavashs

interesting... I think some of this is from calling time.Now less, but I'm not sure what the source of per inhibition rule performance is. Iirc, none of the benchmarks pass excludeTwoSidedMatch, so that can't explain it. I believe this PR does 1 fewer map lookups, but I highly doubt that's measurable at this timescale.

I'm not sure if I'll have time to investigate this in depth for a while, unfortunately.

Spaceman1701 avatar Oct 23 '25 07:10 Spaceman1701

Ok, I had a few minutes to kick off some benchmarks in the background. Interestingly, I'm seeing an even bigger per-inhibit rule performance improvement. I really don't know what's causing this...

                                                                 │ main-with-index.txt │    icache-with-first-result.txt     │
                                                                 │       sec/op        │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                            3.726µ ±  35%   1.381µ ± 19%  -62.93% (p=0.002 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4                          3.359µ ±  16%   1.448µ ± 17%  -56.89% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4                         3.714µ ±  31%   1.431µ ± 10%  -61.48% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4                        3.696µ ±  22%   1.498µ ± 25%  -59.47% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4                       8.774µ ± 154%   1.484µ ± 56%  -83.09% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4                          3.954µ ±  20%   1.498µ ± 13%  -62.12% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4                         3.607µ ±   7%   1.552µ ± 35%  -56.97% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4                        3.716µ ±  13%   1.468µ ±  8%  -60.50% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4                       3.418µ ±  13%   1.506µ ± 28%  -55.94% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4                     3.669µ ±  25%   1.329µ ± 22%  -63.77% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4                          16.239µ ±   6%   1.664µ ±  6%  -89.76% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4                        147.144µ ±  18%   4.798µ ± 17%  -96.74% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4                       1449.51µ ±   7%   31.36µ ± 17%  -97.84% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4                      16880.0µ ±  43%   302.8µ ± 33%  -98.21% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-4                            1.793µ ±  8%
Mutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-4                           1.902µ ± 11%
Mutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-4                          1.781µ ± 18%
Mutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-4                         1.881µ ± 42%
geomean                                                                  15.76µ          2.635µ        -81.47%

I'll investigate... Either way, I think we should just build on what we've merged to main rather than try to rebase this onto main.

Spaceman1701 avatar Oct 29 '25 14:10 Spaceman1701

So... I think that most of the performance difference is actually coming from With calls on Prometheus vectors 😮. My profiler says about 50% of CPU time is spent on this. I'm going to try to refactor a bit so that we already have the references to the right metrics inside the hot path.

Spaceman1701 avatar Oct 29 '25 16:10 Spaceman1701

That seems to have been it - here's the preliminary benchmark after moving metric vector selection outside the hot path:

                                                     │ main-with-index.txt │   main-with-index-fix-metrics.txt   │
                                                     │       sec/op        │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                3.726µ ±  35%   2.736µ ± 40%  -26.55% (p=0.041 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4              3.359µ ±  16%   2.429µ ± 23%  -27.69% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4             3.714µ ±  31%   2.693µ ± 16%  -27.48% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4            3.696µ ±  22%   2.702µ ± 43%  -26.91% (p=0.026 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4           8.774µ ± 154%   4.248µ ± 26%  -51.58% (p=0.015 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4              3.954µ ±  20%   2.742µ ± 12%  -30.66% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4             3.607µ ±   7%   2.528µ ±  5%  -29.91% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4            3.716µ ±  13%   2.091µ ± 29%  -43.74% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4           3.418µ ±  13%   2.206µ ± 15%  -35.46% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4         3.669µ ±  25%   1.985µ ±  8%  -45.89% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4              16.239µ ±   6%   4.982µ ± 16%  -69.32% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4             147.14µ ±  18%   34.80µ ±  5%  -76.35% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4            1449.5µ ±   7%   335.6µ ±  5%  -76.85% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4           16.880m ±  43%   3.701m ±  8%  -78.08% (p=0.002 n=6)
geomean                                                      15.76µ          7.747µ        -50.85%

Spaceman1701 avatar Oct 29 '25 19:10 Spaceman1701

And a bit more improvement after consolidating some time.Now() calls:

                                                     │ main-with-index.txt │ main-with-index-fix-metrics-and-time.txt │
                                                     │       sec/op        │      sec/op        vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                3.726µ ±  35%        1.806µ ± 10%  -51.52% (p=0.002 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4              3.359µ ±  16%        1.894µ ± 25%  -43.63% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4             3.714µ ±  31%        1.922µ ± 13%  -48.24% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4            3.696µ ±  22%        2.071µ ± 20%  -43.95% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4           8.774µ ± 154%        2.856µ ± 26%  -67.45% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4              3.954µ ±  20%        1.877µ ± 43%  -52.54% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4             3.607µ ±   7%        2.134µ ± 18%  -40.84% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4            3.716µ ±  13%        1.912µ ± 36%  -48.55% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4           3.418µ ±  13%        2.047µ ± 16%  -40.11% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4         3.669µ ±  25%        1.860µ ±  6%  -49.30% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4              16.239µ ±   6%        4.400µ ± 10%  -72.91% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4             147.14µ ±  18%        27.75µ ±  6%  -81.14% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4            1449.5µ ±   7%        278.1µ ±  7%  -80.81% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4           16.880m ±  43%        2.888m ± 10%  -82.89% (p=0.002 n=6)
geomean                                                      15.76µ               6.152µ        -60.98%

It might improve things a little further if we can cache the excludeTwoSidedMatch, but I don't think much more improvement is possible without removing some of the new metrics.

For comparison, this is where we are compared to this PR:

                                                                 │ main-with-index-fix-metrics-and-time.txt │    icache-with-first-result.txt     │
                                                                 │                  sec/op                  │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4                                                  1.806µ ± 10%   1.381µ ± 19%  -23.53% (p=0.009 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4                                                1.894µ ± 25%   1.448µ ± 17%  -23.53% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4                                               1.922µ ± 13%   1.431µ ± 10%  -25.57% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4                                              2.071µ ± 20%   1.498µ ± 25%  -27.69% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4                                             2.856µ ± 26%   1.484µ ± 56%  -48.05% (p=0.004 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4                                                1.877µ ± 43%   1.498µ ± 13%  -20.20% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4                                               2.134µ ± 18%   1.552µ ± 35%  -27.27% (p=0.009 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4                                              1.912µ ± 36%   1.468µ ±  8%  -23.23% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4                                             2.047µ ± 16%   1.506µ ± 28%  -26.43% (p=0.015 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4                                           1.860µ ±  6%   1.329µ ± 22%  -28.55% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4                                                 4.400µ ± 10%   1.664µ ±  6%  -62.19% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4                                               27.751µ ±  6%   4.798µ ± 17%  -82.71% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4                                              278.10µ ±  7%   31.36µ ± 17%  -88.72% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4                                             2888.3µ ± 10%   302.8µ ± 33%  -89.52% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts_w/_last_match-4                                                 1.793µ ±  8%
Mutes/1_inhibition_rule,_100_inhibiting_alerts_w/_last_match-4                                                1.902µ ± 11%
Mutes/1_inhibition_rule,_1000_inhibiting_alerts_w/_last_match-4                                               1.781µ ± 18%
Mutes/1_inhibition_rule,_10000_inhibiting_alerts_w/_last_match-4                                              1.881µ ± 42%
geomean                                                                                        6.152µ         2.635µ        -52.52%

I think I might open a PR with the changes I have. I can iterate a little more there.

Spaceman1701 avatar Oct 30 '25 14:10 Spaceman1701

Close in favor of #4668

Spaceman1701 avatar Nov 03 '25 18:11 Spaceman1701