alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Add Inhibitor.MutesAll()

Open Nuckal777 opened this issue 1 year ago • 8 comments

For the context please refer to #3932.

I added benchmarks, which consider more than one target alert. The following benchstat output compares:

  • invoking Inhibitor.Mutes() for each target alert (old.txt)
  • with the Inhibitor.MutesAll() implementation (new.txt). Inhibitor.MutesAll() increase the speed in certain cases at the cost of some memory.
goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/inhibit
                                                                        │    old.txt    │               new.txt                 │
                                                                        │    sec/op     │    sec/op     vs base                 │
Mutes/1_inhibition_rule,_1_inhibiting_alert,_1_target_alert-10              768.5n ± 3%    842.5n ± 5%    +9.62% (p=0.001 n=10)
Mutes/10_inhibition_rules,_1_inhibiting_alert,_1_target_alert-10            769.5n ± 2%    834.4n ± 0%    +8.44% (p=0.000 n=10)
Mutes/100_inhibition_rules,_1_inhibiting_alert,_1_target_alert-10           779.3n ± 1%    841.4n ± 1%    +7.96% (p=0.000 n=10)
Mutes/1000_inhibition_rules,_1_inhibiting_alert,_1_target_alert-10          839.5n ± 1%    914.7n ± 4%    +8.96% (p=0.000 n=10)
Mutes/1000_inhibition_rules,_100_inhibiting_alert,_100_target_alert-10     157.39µ ± 1%    54.41µ ± 1%   -65.43% (p=0.000 n=10)
Mutes/10000_inhibition_rules,_1_inhibiting_alert,_1_target_alert-10         897.5n ± 3%    962.6n ± 2%    +7.25% (p=0.000 n=10)
Mutes/10000_inhibition_rules,_100_inhibiting_alert,_100_target_alert-10     63.07µ ± 6%    57.59µ ± 3%    -8.68% (p=0.000 n=10)
Mutes/1_inhibition_rule,_10_inhibiting_alerts,_1_target_alert-10            886.9n ± 1%   1087.0n ± 1%   +22.57% (p=0.000 n=10)
Mutes/1_inhibition_rule,_100_inhibiting_alerts,_1_target_alert-10           1.929µ ± 2%    3.083µ ± 1%   +59.84% (p=0.000 n=10)
Mutes/1_inhibition_rule,_100_inhibiting_alerts,_100_target_alert-10        174.18µ ± 1%    60.44µ ± 1%   -65.30% (p=0.000 n=10)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts,_1_target_alert-10          12.67µ ± 1%    25.82µ ± 1%  +103.87% (p=0.000 n=10)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts,_1000_target_alert-10     12425.7µ ± 1%    582.3µ ± 1%   -95.31% (p=0.000 n=10)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts,_1_target_alert-10         104.4µ ± 1%    278.8µ ± 1%  +167.13% (p=0.000 n=10)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts,_1_target_alert-10       11.89µ ± 2%    24.69µ ± 0%  +107.65% (p=0.000 n=10)
Mutes/10_inhibition_rules,_last_rule_matches,_1_target_alert-10             1.012µ ± 3%    1.125µ ± 1%   +11.22% (p=0.000 n=10)
Mutes/100_inhibition_rules,_last_rule_matches,_1_target_alert-10            3.178µ ± 1%    3.872µ ± 1%   +21.86% (p=0.000 n=10)
Mutes/100_inhibition_rules,_last_rule_matches,_100_target_alerts-10         299.9µ ± 1%    227.9µ ± 1%   -24.00% (p=0.000 n=10)
Mutes/1000_inhibition_rules,_last_rule_matches,_1_target_alert-10           24.62µ ± 3%    31.57µ ± 1%   +28.24% (p=0.000 n=10)
Mutes/1000_inhibition_rules,_last_rule_matches,_1000_target_alert-10        24.29m ± 0%    18.27m ± 0%   -24.77% (p=0.000 n=10)
Mutes/10000_inhibition_rules,_last_rule_matches,_1_target_alert-10          239.6µ ± 6%    306.5µ ± 0%   +27.94% (p=0.000 n=10)
geomean                                                                     18.71µ         17.34µ         -7.36%

In cases where a low count of alerts is involved Inhibitor.MutesAll() is 2-3 times slower, due to the additional memory allocations. This should still be sufficiently fast given the small scale. When the count of target alerts is increased Inhibitor.MutesAll() becomes faster. Invoking it as part of GET /alerts still needs to be done.

Nuckal777 avatar Aug 01 '24 14:08 Nuckal777

Do you intend to call it from MuteStage as well as APIv2?

Yes.

The PR seems incomplete without it?

I thought I check the general interest for the topic before implementing everything. I will add it. Might take a few days.

Nuckal777 avatar Aug 22 '24 07:08 Nuckal777

I added MutesAll to the api and MuteStage code paths. The CI needs a re-run, network foo during setup.

Nuckal777 avatar Aug 27 '24 07:08 Nuckal777

Is it possible to optimize inhibition rules without changing the Mutes interface?

I implemented that. In the current state the algorithmic improvement can only be realised when all alerts are passed to MutesAll(). By moving setAlertStatus out of alertFilter the different API handlers can choose how to update it. For getAlertsHandler I used MutesAll() here.

Nuckal777 avatar Sep 04 '24 15:09 Nuckal777

Is it possible to optimize inhibition rules without changing the Mutes interface?

I implemented that. In the current state the algorithmic improvement can only be realised when all alerts are passed to MutesAll().

I looked again at the code, but I'm afraid I don't see where the optimization is implemented for Mutes()? Like you said, the performance optimization only works if you pass all alerts to MutesAll(), and there is no performance improvement if you only call Mutes()? Did I misunderstand?

My suggestion was to see if we can optimize the performance of Mutes without having to pass all alerts to MutesAll(). For example, what if the cache is persistent, like matcherCache for silences?

grobinson-grafana avatar Sep 04 '24 16:09 grobinson-grafana

there is no performance improvement if you only call Mutes()?

Exactly. The whole speed improvement comes from caching the source cache evaluation of InhibitRules across multiple alerts. In the context of inhibitions, that's the whole set of potential target alerts. For MutesAll() it roughly boils down to:

for _, r := range ih.rules {
    var alerts []*types.Alert // <- reused in the inner loop
    var scacheEval []bool // <- reused in the inner loop
    for i, lset := range lsets {
        // ...
        if inhibitedByFP, eq := r.hasEqualCached(lset, r.SourceMatchers.Matches(lset), alerts, scacheEval); eq {
            ih.marker.SetInhibited(fingerprints[i], inhibitedByFP.String())
        }
        // ...
    }
}

Just using Mutes() flips the two loops. That looks roughly like:

func someOtherFunc() {
    // ...
    for _, a := range alerts {
        inhibitor.Mutes(a.Labels) // <- evaluates the source cache for each a
    }
}

func (ih *Inhibitor) Mutes(lset model.LabelSet) bool {
    // ...
    for _, r := range ih.rules {
        // ...
        if inhibitedByFP, eq := r.hasEqual(lset, r.SourceMatchers.Matches(lset)); eq { // <- always evaluates the source cache matches
            ih.marker.SetInhibited(fp, inhibitedByFP.String())
            return true
        }
    }
}

Caching the source cache evaluations "requires" to pass a list of alerts. If the signature change/addition is undesirable, we can close this PR. It doesn't work without. Technically, the state MutesAll() caches can be externalized into a struct, which implements the Muter interface, e.g.:

type CachingInhibitor {
    base Inhibitor
    var alerts [][]*types.Alert // one cache for each inhibition rule
    var scacheEval [][]bool // one cache for each inhibition rule
}

Externalizing the state would cost more memory, likely requires locking and requires explicitly dropping these caches at some point (likely after calling Mutes() a few times).

My suggestion was to see if we can optimize the performance of Mutes without having to pass all alerts to MutesAll(). For example, what if the cache is persistent, like matcherCache for silences?

I expect something in that direction is possible as well, but it's something for different PR.

Nuckal777 avatar Sep 06 '24 08:09 Nuckal777

Hi! I know it's been a while, is this change something you're still interested in working on?

Inhibitor performance has also been improved a lot, so I'm not sure what the performance gains of this PR are anymore. If you still think it'd be useful (and have the time), the first step is probably to rebase and collect new benchmarks.

Spaceman1701 avatar Nov 12 '25 17:11 Spaceman1701

Hey, thanks for the information. We tried the latest release, but haven't seen a significant improvement unfortunately.

If you still think it'd be useful (and have the time), the first step is probably to rebase and collect new benchmarks.

I'll have to see, if I get around to it. 😢

Nuckal777 avatar Nov 21 '25 17:11 Nuckal777

Hey, thanks for the information. We tried the latest release, but haven't seen a significant improvement unfortunately.

Ah actually, these improvements aren't in a release quite yet. They should be included in v0.30 (which we're planning to cut in a few weeks).

Spaceman1701 avatar Nov 21 '25 17:11 Spaceman1701