Silences: Optimize incremental mutes queries via a silence version index

Open ultrotter opened this issue 2 months ago • 1 comments

Currently Mutes is optimized for when the global silences haven't changed at all, and either our label set had zero or a few silences. When even a single silence has been added to the state though, the entire state needs to be scanned to find any match.

Since we know that existing silences cannot start matching without being re-added to the state, and we have a state version at which we are "caught up" available, we can optimize the code path when a few silences have been added to the state by:

Creating an ordered index of versions, and silences which were added at that version
Allowing querying for silences matching filters that were added after a certain moment (QSince(version))
During Mutes performing a query for the old known silences, and one scanning just the newly added ones

This trades some memory, with a greatly increased performance for the situation when we have a lot of silences, and only a few have been added to the state since we last queried.

The GC is able to keep the state and indices up to date, in place, and reallocate the index to shed memory load when the set of silences has stably halved.

goos: linux
goarch: amd64
pkg: github.com/prometheus/alertmanager/silence
cpu: AMD EPYC Processor (with IBPB)
                                                              │ 0-qsince.txt  │            1-qsince.txt             │
                                                              │    sec/op     │    sec/op     vs base               │
Mutes/0_total,_0_matching                                        298.7n ± ∞ ¹   289.4n ± ∞ ¹        ~ (p=0.651 n=5)
Mutes/1_total,_1_matching                                        1.365µ ± ∞ ¹   1.406µ ± ∞ ¹   +3.00% (p=0.008 n=5)
Mutes/100_total,_10_matching                                     3.527µ ± ∞ ¹   3.132µ ± ∞ ¹  -11.20% (p=0.008 n=5)
Mutes/1000_total,_1_matching                                     1.513µ ± ∞ ¹   1.543µ ± ∞ ¹   +1.98% (p=0.008 n=5)
Mutes/1000_total,_10_matching                                    3.942µ ± ∞ ¹   3.479µ ± ∞ ¹  -11.75% (p=0.008 n=5)
Mutes/1000_total,_100_matching                                   24.53µ ± ∞ ¹   22.06µ ± ∞ ¹  -10.07% (p=0.008 n=5)
Mutes/10000_total,_0_matching                                    4.737µ ± ∞ ¹   4.202µ ± ∞ ¹  -11.29% (p=0.008 n=5)
Mutes/10000_total,_10_matching                                   4.839µ ± ∞ ¹   4.200µ ± ∞ ¹  -13.21% (p=0.008 n=5)
Mutes/10000_total,_1000_matching                                 353.9µ ± ∞ ¹   324.9µ ± ∞ ¹   -8.20% (p=0.008 n=5)
MutesIncremental/1000_base_silences                              331.8µ ± ∞ ¹   319.2µ ± ∞ ¹   -3.80% (p=0.008 n=5)
MutesIncremental/3000_base_silences                              512.3µ ± ∞ ¹   307.5µ ± ∞ ¹  -39.98% (p=0.008 n=5)
MutesIncremental/7000_base_silences                              690.8µ ± ∞ ¹   296.9µ ± ∞ ¹  -57.03% (p=0.008 n=5)
MutesIncremental/10000_base_silences                             950.6µ ± ∞ ¹   283.3µ ± ∞ ¹  -70.20% (p=0.008 n=5)
Query/100_silences                                               24.80µ ± ∞ ¹   24.19µ ± ∞ ¹   -2.46% (p=0.008 n=5)
Query/1000_silences                                              356.5µ ± ∞ ¹   336.0µ ± ∞ ¹   -5.73% (p=0.008 n=5)
Query/10000_silences                                            10.441m ± ∞ ¹   6.283m ± ∞ ¹  -39.82% (p=0.008 n=5)
QueryParallel/100_silences                                       23.12µ ± ∞ ¹   22.83µ ± ∞ ¹        ~ (p=0.310 n=5)
QueryParallel/1000_silences                                      362.1µ ± ∞ ¹   344.6µ ± ∞ ¹   -4.85% (p=0.008 n=5)
QueryParallel/10000_silences                                    10.047m ± ∞ ¹   6.256m ± ∞ ¹  -37.73% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_10%_add_rate      971.6µ ± ∞ ¹   843.5µ ± ∞ ¹  -13.18% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_1%_add_rate       419.5µ ± ∞ ¹   396.9µ ± ∞ ¹   -5.38% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_0.1%_add_rate     380.2µ ± ∞ ¹   354.8µ ± ∞ ¹   -6.68% (p=0.008 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_1%_add_rate     10.767m ± ∞ ¹   6.532m ± ∞ ¹  -39.34% (p=0.008 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_0.1%_add_rate   11.072m ± ∞ ¹   6.214m ± ∞ ¹  -43.87% (p=0.008 n=5)**
MutesParallel/100_silences                                       21.22µ ± ∞ ¹   21.53µ ± ∞ ¹   +1.48% (p=0.008 n=5)
MutesParallel/1000_silences                                      242.8µ ± ∞ ¹   236.6µ ± ∞ ¹   -2.56% (p=0.008 n=5)
MutesParallel/10000_silences                                     3.883m ± ∞ ¹   3.429m ± ∞ ¹  -11.67% (p=0.008 n=5)
geomean                                                          126.3µ         101.5µ        -19.60%
¹ need >= 6 samples for confidence interval at level 0.95


                                                              │ 0-qsince.txt  │               1-qsince.txt               │
                                                              │     B/op      │      B/op       vs base                  │
Mutes/0_total,_0_matching                                         376.0 ± ∞ ¹      376.0 ± ∞ ¹         ~ (p=1.000 n=5) ²
Mutes/1_total,_1_matching                                       1.219Ki ± ∞ ¹    1.250Ki ± ∞ ¹    +2.56% (p=0.008 n=5)
Mutes/100_total,_10_matching                                    4.047Ki ± ∞ ¹    3.727Ki ± ∞ ¹    -7.92% (p=0.008 n=5)
Mutes/1000_total,_1_matching                                    1.219Ki ± ∞ ¹    1.250Ki ± ∞ ¹    +2.56% (p=0.008 n=5)
Mutes/1000_total,_10_matching                                   4.047Ki ± ∞ ¹    3.727Ki ± ∞ ¹    -7.92% (p=0.008 n=5)
Mutes/1000_total,_100_matching                                  31.27Ki ± ∞ ¹    29.18Ki ± ∞ ¹    -6.67% (p=0.008 n=5)
Mutes/10000_total,_0_matching                                   4.047Ki ± ∞ ¹    3.727Ki ± ∞ ¹    -7.92% (p=0.008 n=5)
Mutes/10000_total,_10_matching                                  4.047Ki ± ∞ ¹    3.727Ki ± ∞ ¹    -7.92% (p=0.008 n=5)
Mutes/10000_total,_1000_matching                                287.6Ki ± ∞ ¹    276.1Ki ± ∞ ¹    -3.98% (p=0.008 n=5)
MutesIncremental/1000_base_silences                             72.03Ki ± ∞ ¹   230.60Ki ± ∞ ¹  +220.15% (p=0.008 n=5)
MutesIncremental/3000_base_silences                             72.30Ki ± ∞ ¹   213.23Ki ± ∞ ¹  +194.91% (p=0.008 n=5)
MutesIncremental/7000_base_silences                             36.60Ki ± ∞ ¹   190.82Ki ± ∞ ¹  +421.37% (p=0.008 n=5)
MutesIncremental/10000_base_silences                            24.44Ki ± ∞ ¹   169.77Ki ± ∞ ¹  +594.66% (p=0.008 n=5)
Query/100_silences                                              2.711Ki ± ∞ ¹    2.984Ki ± ∞ ¹   +10.09% (p=0.008 n=5)
Query/1000_silences                                             22.87Ki ± ∞ ¹    22.39Ki ± ∞ ¹    -2.08% (p=0.008 n=5)
Query/10000_silences                                            220.7Ki ± ∞ ¹    220.2Ki ± ∞ ¹    -0.22% (p=0.008 n=5)
QueryParallel/100_silences                                      2.641Ki ± ∞ ¹    2.914Ki ± ∞ ¹   +10.36% (p=0.008 n=5)
QueryParallel/1000_silences                                     22.80Ki ± ∞ ¹    22.32Ki ± ∞ ¹    -2.09% (p=0.008 n=5)
QueryParallel/10000_silences                                    220.6Ki ± ∞ ¹    220.1Ki ± ∞ ¹    -0.22% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_10%_add_rate     161.1Ki ± ∞ ¹    176.0Ki ± ∞ ¹    +9.28% (p=0.032 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_1%_add_rate      38.07Ki ± ∞ ¹    39.01Ki ± ∞ ¹    +2.46% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_0.1%_add_rate    24.22Ki ± ∞ ¹    23.94Ki ± ∞ ¹    -1.13% (p=0.008 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_1%_add_rate     218.6Ki ± ∞ ¹    218.7Ki ± ∞ ¹         ~ (p=0.310 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_0.1%_add_rate   219.9Ki ± ∞ ¹    219.8Ki ± ∞ ¹    -0.06% (p=0.008 n=5)
MutesParallel/100_silences                                      31.27Ki ± ∞ ¹    29.18Ki ± ∞ ¹    -6.67% (p=0.008 n=5)
MutesParallel/1000_silences                                     287.6Ki ± ∞ ¹    276.1Ki ± ∞ ¹    -3.98% (p=0.008 n=5)
MutesParallel/10000_silences                                    3.228Mi ± ∞ ¹    2.688Mi ± ∞ ¹   -16.73% (p=0.008 n=5)
geomean                                                         26.38Ki          32.22Ki         +22.15%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

There is an increase in memory usage in MutesIncremental, which we could potentially mitigate in the future allowing QSince and QIDS queries in the same call, or allowing query types which return IDs instead of silence lists. The cost is still worth it as it's only in the case of massive silence loads that would make alertmanager nowadays fail, and is minimal compared to the current size of servers.

                                                              │ 0-qsince.txt │              1-qsince.txt              │
                                                              │  allocs/op   │  allocs/op    vs base                  │
Mutes/0_total,_0_matching                                        4.000 ± ∞ ¹    4.000 ± ∞ ¹         ~ (p=1.000 n=5) ²
Mutes/1_total,_1_matching                                        19.00 ± ∞ ¹    20.00 ± ∞ ¹    +5.26% (p=0.008 n=5)
Mutes/100_total,_10_matching                                     37.00 ± ∞ ¹    30.00 ± ∞ ¹   -18.92% (p=0.008 n=5)
Mutes/1000_total,_1_matching                                     19.00 ± ∞ ¹    20.00 ± ∞ ¹    +5.26% (p=0.008 n=5)
Mutes/1000_total,_10_matching                                    37.00 ± ∞ ¹    30.00 ± ∞ ¹   -18.92% (p=0.008 n=5)
Mutes/1000_total,_100_matching                                   133.0 ± ∞ ¹    120.0 ± ∞ ¹    -9.77% (p=0.008 n=5)
Mutes/10000_total,_0_matching                                    37.00 ± ∞ ¹    30.00 ± ∞ ¹   -18.92% (p=0.008 n=5)
Mutes/10000_total,_10_matching                                   37.00 ± ∞ ¹    30.00 ± ∞ ¹   -18.92% (p=0.008 n=5)
Mutes/10000_total,_1000_matching                                1.041k ± ∞ ¹   1.022k ± ∞ ¹    -1.83% (p=0.008 n=5)
MutesIncremental/1000_base_silences                              283.0 ± ∞ ¹    853.0 ± ∞ ¹  +201.41% (p=0.008 n=5)
MutesIncremental/3000_base_silences                              284.0 ± ∞ ¹    791.0 ± ∞ ¹  +178.52% (p=0.008 n=5)
MutesIncremental/7000_base_silences                              156.0 ± ∞ ¹    710.0 ± ∞ ¹  +355.13% (p=0.008 n=5)
MutesIncremental/10000_base_silences                             112.0 ± ∞ ¹    633.0 ± ∞ ¹  +465.18% (p=0.008 n=5)
Query/100_silences                                               27.00 ± ∞ ¹    23.00 ± ∞ ¹   -14.81% (p=0.008 n=5)
Query/1000_silences                                              120.0 ± ∞ ¹    114.0 ± ∞ ¹    -5.00% (p=0.008 n=5)
Query/10000_silences                                            1.023k ± ∞ ¹   1.017k ± ∞ ¹    -0.59% (p=0.008 n=5)
QueryParallel/100_silences                                       24.00 ± ∞ ¹    20.00 ± ∞ ¹   -16.67% (p=0.008 n=5)
QueryParallel/1000_silences                                      117.0 ± ∞ ¹    111.0 ± ∞ ¹    -5.13% (p=0.008 n=5)
QueryParallel/10000_silences                                    1.020k ± ∞ ¹   1.014k ± ∞ ¹    -0.59% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_10%_add_rate      748.0 ± ∞ ¹    818.0 ± ∞ ¹    +9.36% (p=0.032 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_1%_add_rate       183.0 ± ∞ ¹    184.0 ± ∞ ¹         ~ (p=0.405 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_0.1%_add_rate     124.0 ± ∞ ¹    119.0 ± ∞ ¹    -4.03% (p=0.008 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_1%_add_rate     1.012k ± ∞ ¹   1.009k ± ∞ ¹    -0.30% (p=0.008 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_0.1%_add_rate   1.017k ± ∞ ¹   1.013k ± ∞ ¹    -0.39% (p=0.008 n=5)
MutesParallel/100_silences                                       133.0 ± ∞ ¹    120.0 ± ∞ ¹    -9.77% (p=0.008 n=5)
MutesParallel/1000_silences                                     1.041k ± ∞ ¹   1.022k ± ∞ ¹    -1.83% (p=0.008 n=5)
MutesParallel/10000_silences                                    10.05k ± ∞ ¹   10.02k ± ∞ ¹    -0.33% (p=0.008 n=5)
geomean                                                          154.3          178.8         +15.89%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

GC is not substantially slower, and only spends more memory when it needs to free up memory after a downsize of more than 50%, doing one more allocation then.

goos: linux
goarch: amd64
pkg: github.com/prometheus/alertmanager/silence
cpu: AMD EPYC Processor (with IBPB)
                               │   gc0.txt    │                gc1.txt                 │
                               │    sec/op    │    sec/op      vs base                 │
GC/100_silences,_0%_expired      10.10µ ± ∞ ¹    11.52µ ± ∞ ¹        ~ (p=0.333 n=2) ²
GC/100_silences,_10%_expired     11.48µ ± ∞ ¹    12.60µ ± ∞ ¹        ~ (p=0.333 n=2) ²
GC/100_silences,_50%_expired     15.09µ ± ∞ ¹    15.89µ ± ∞ ¹        ~ (p=0.333 n=2) ²
GC/100_silences,_90%_expired     18.08µ ± ∞ ¹    18.78µ ± ∞ ¹        ~ (p=0.333 n=2) ²
GC/1000_silences,_0%_expired     66.35µ ± ∞ ¹    81.53µ ± ∞ ¹        ~ (p=0.333 n=2) ²
GC/1000_silences,_10%_expired    79.73µ ± ∞ ¹    92.18µ ± ∞ ¹        ~ (p=0.333 n=2) ²
GC/1000_silences,_50%_expired    121.3µ ± ∞ ¹    129.0µ ± ∞ ¹        ~ (p=0.333 n=2) ²
GC/1000_silences,_90%_expired    156.6µ ± ∞ ¹    166.2µ ± ∞ ¹        ~ (p=0.333 n=2) ²
GC/10000_silences,_0%_expired    649.0µ ± ∞ ¹   1037.9µ ± ∞ ¹        ~ (p=0.333 n=2) ²
GC/10000_silences,_10%_expired   823.0µ ± ∞ ¹   1193.3µ ± ∞ ¹        ~ (p=0.333 n=2) ²
GC/10000_silences,_50%_expired   1.383m ± ∞ ¹    1.623m ± ∞ ¹        ~ (p=0.333 n=2) ²
GC/10000_silences,_90%_expired   1.817m ± ∞ ¹    2.103m ± ∞ ¹        ~ (p=0.333 n=2) ²
geomean                          112.8µ          132.6µ        +17.49%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                               │    gc0.txt    │                 gc1.txt                  │
                               │     B/op      │      B/op        vs base                 │
GC/100_silences,_0%_expired      1.109Ki ± ∞ ¹     1.109Ki ± ∞ ¹        ~ (p=1.000 n=2) ²
GC/100_silences,_10%_expired     1.109Ki ± ∞ ¹     1.109Ki ± ∞ ¹        ~ (p=1.000 n=2) ²
GC/100_silences,_50%_expired     1.109Ki ± ∞ ¹     1.109Ki ± ∞ ¹        ~ (p=1.000 n=2) ²
GC/100_silences,_90%_expired     1.109Ki ± ∞ ¹     1.109Ki ± ∞ ¹        ~ (p=1.000 n=2) ²
GC/1000_silences,_0%_expired     1.109Ki ± ∞ ¹     1.109Ki ± ∞ ¹        ~ (p=1.000 n=2) ²
GC/1000_silences,_10%_expired    1.109Ki ± ∞ ¹     1.109Ki ± ∞ ¹        ~ (p=1.000 n=2) ²
GC/1000_silences,_50%_expired    1.109Ki ± ∞ ¹     1.109Ki ± ∞ ¹        ~ (p=1.000 n=2) ²
GC/1000_silences,_90%_expired    1.109Ki ± ∞ ¹     1.109Ki ± ∞ ¹        ~ (p=1.000 n=2) ²
GC/10000_silences,_0%_expired    1.109Ki ± ∞ ¹     1.109Ki ± ∞ ¹        ~ (p=1.000 n=2) ²
GC/10000_silences,_10%_expired   1.109Ki ± ∞ ¹     1.109Ki ± ∞ ¹        ~ (p=1.000 n=2) ²
GC/10000_silences,_50%_expired   1.109Ki ± ∞ ¹     1.109Ki ± ∞ ¹        ~ (p=1.000 n=2) ²
GC/10000_silences,_90%_expired   1.109Ki ± ∞ ¹   121.109Ki ± ∞ ¹        ~ (p=0.333 n=2) ³
geomean                          1.109Ki           1.640Ki        +47.86%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ need >= 4 samples to detect a difference at alpha level 0.05


                               │   gc0.txt   │               gc1.txt               │
                               │  allocs/op  │  allocs/op   vs base                │
GC/100_silences,_0%_expired      18.00 ± ∞ ¹   18.00 ± ∞ ¹       ~ (p=1.000 n=2) ²
GC/100_silences,_10%_expired     18.00 ± ∞ ¹   18.00 ± ∞ ¹       ~ (p=1.000 n=2) ²
GC/100_silences,_50%_expired     18.00 ± ∞ ¹   18.00 ± ∞ ¹       ~ (p=1.000 n=2) ²
GC/100_silences,_90%_expired     18.00 ± ∞ ¹   18.00 ± ∞ ¹       ~ (p=1.000 n=2) ²
GC/1000_silences,_0%_expired     18.00 ± ∞ ¹   18.00 ± ∞ ¹       ~ (p=1.000 n=2) ²
GC/1000_silences,_10%_expired    18.00 ± ∞ ¹   18.00 ± ∞ ¹       ~ (p=1.000 n=2) ²
GC/1000_silences,_50%_expired    18.00 ± ∞ ¹   18.00 ± ∞ ¹       ~ (p=1.000 n=2) ²
GC/1000_silences,_90%_expired    18.00 ± ∞ ¹   18.00 ± ∞ ¹       ~ (p=1.000 n=2) ²
GC/10000_silences,_0%_expired    18.00 ± ∞ ¹   18.00 ± ∞ ¹       ~ (p=1.000 n=2) ²
GC/10000_silences,_10%_expired   18.00 ± ∞ ¹   18.00 ± ∞ ¹       ~ (p=1.000 n=2) ²
GC/10000_silences,_50%_expired   18.00 ± ∞ ¹   18.00 ± ∞ ¹       ~ (p=1.000 n=2) ²
GC/10000_silences,_90%_expired   18.00 ± ∞ ¹   19.00 ± ∞ ¹       ~ (p=0.333 n=2) ³
geomean                          18.00         18.08        +0.45%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ need >= 4 samples to detect a difference at alpha level 0.05

Nov 10 '25 19:11 ultrotter

This needs a rebase.

Nov 12 '25 13:11 SuperQ