Silences: Optimize incremental mutes queries via a silence version index
Currently Mutes is optimized for when the global silences haven't changed at all, and either our label set had zero or a few silences. When even a single silence has been added to the state though, the entire state needs to be scanned to find any match.
Since we know that existing silences cannot start matching without being re-added to the state, and we have a state version at which we are "caught up" available, we can optimize the code path when a few silences have been added to the state by:
- Creating an ordered index of versions, and silences which were added at that version
- Allowing querying for silences matching filters that were added after a certain moment (QSince(version))
- During Mutes performing a query for the old known silences, and one scanning just the newly added ones
This trades some memory, with a greatly increased performance for the situation when we have a lot of silences, and only a few have been added to the state since we last queried.
The GC is able to keep the state and indices up to date, in place, and reallocate the index to shed memory load when the set of silences has stably halved.
goos: linux
goarch: amd64
pkg: github.com/prometheus/alertmanager/silence
cpu: AMD EPYC Processor (with IBPB)
│ 0-qsince.txt │ 1-qsince.txt │
│ sec/op │ sec/op vs base │
Mutes/0_total,_0_matching 298.7n ± ∞ ¹ 289.4n ± ∞ ¹ ~ (p=0.651 n=5)
Mutes/1_total,_1_matching 1.365µ ± ∞ ¹ 1.406µ ± ∞ ¹ +3.00% (p=0.008 n=5)
Mutes/100_total,_10_matching 3.527µ ± ∞ ¹ 3.132µ ± ∞ ¹ -11.20% (p=0.008 n=5)
Mutes/1000_total,_1_matching 1.513µ ± ∞ ¹ 1.543µ ± ∞ ¹ +1.98% (p=0.008 n=5)
Mutes/1000_total,_10_matching 3.942µ ± ∞ ¹ 3.479µ ± ∞ ¹ -11.75% (p=0.008 n=5)
Mutes/1000_total,_100_matching 24.53µ ± ∞ ¹ 22.06µ ± ∞ ¹ -10.07% (p=0.008 n=5)
Mutes/10000_total,_0_matching 4.737µ ± ∞ ¹ 4.202µ ± ∞ ¹ -11.29% (p=0.008 n=5)
Mutes/10000_total,_10_matching 4.839µ ± ∞ ¹ 4.200µ ± ∞ ¹ -13.21% (p=0.008 n=5)
Mutes/10000_total,_1000_matching 353.9µ ± ∞ ¹ 324.9µ ± ∞ ¹ -8.20% (p=0.008 n=5)
MutesIncremental/1000_base_silences 331.8µ ± ∞ ¹ 319.2µ ± ∞ ¹ -3.80% (p=0.008 n=5)
MutesIncremental/3000_base_silences 512.3µ ± ∞ ¹ 307.5µ ± ∞ ¹ -39.98% (p=0.008 n=5)
MutesIncremental/7000_base_silences 690.8µ ± ∞ ¹ 296.9µ ± ∞ ¹ -57.03% (p=0.008 n=5)
MutesIncremental/10000_base_silences 950.6µ ± ∞ ¹ 283.3µ ± ∞ ¹ -70.20% (p=0.008 n=5)
Query/100_silences 24.80µ ± ∞ ¹ 24.19µ ± ∞ ¹ -2.46% (p=0.008 n=5)
Query/1000_silences 356.5µ ± ∞ ¹ 336.0µ ± ∞ ¹ -5.73% (p=0.008 n=5)
Query/10000_silences 10.441m ± ∞ ¹ 6.283m ± ∞ ¹ -39.82% (p=0.008 n=5)
QueryParallel/100_silences 23.12µ ± ∞ ¹ 22.83µ ± ∞ ¹ ~ (p=0.310 n=5)
QueryParallel/1000_silences 362.1µ ± ∞ ¹ 344.6µ ± ∞ ¹ -4.85% (p=0.008 n=5)
QueryParallel/10000_silences 10.047m ± ∞ ¹ 6.256m ± ∞ ¹ -37.73% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_10%_add_rate 971.6µ ± ∞ ¹ 843.5µ ± ∞ ¹ -13.18% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_1%_add_rate 419.5µ ± ∞ ¹ 396.9µ ± ∞ ¹ -5.38% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_0.1%_add_rate 380.2µ ± ∞ ¹ 354.8µ ± ∞ ¹ -6.68% (p=0.008 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_1%_add_rate 10.767m ± ∞ ¹ 6.532m ± ∞ ¹ -39.34% (p=0.008 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_0.1%_add_rate 11.072m ± ∞ ¹ 6.214m ± ∞ ¹ -43.87% (p=0.008 n=5)**
MutesParallel/100_silences 21.22µ ± ∞ ¹ 21.53µ ± ∞ ¹ +1.48% (p=0.008 n=5)
MutesParallel/1000_silences 242.8µ ± ∞ ¹ 236.6µ ± ∞ ¹ -2.56% (p=0.008 n=5)
MutesParallel/10000_silences 3.883m ± ∞ ¹ 3.429m ± ∞ ¹ -11.67% (p=0.008 n=5)
geomean 126.3µ 101.5µ -19.60%
¹ need >= 6 samples for confidence interval at level 0.95
│ 0-qsince.txt │ 1-qsince.txt │
│ B/op │ B/op vs base │
Mutes/0_total,_0_matching 376.0 ± ∞ ¹ 376.0 ± ∞ ¹ ~ (p=1.000 n=5) ²
Mutes/1_total,_1_matching 1.219Ki ± ∞ ¹ 1.250Ki ± ∞ ¹ +2.56% (p=0.008 n=5)
Mutes/100_total,_10_matching 4.047Ki ± ∞ ¹ 3.727Ki ± ∞ ¹ -7.92% (p=0.008 n=5)
Mutes/1000_total,_1_matching 1.219Ki ± ∞ ¹ 1.250Ki ± ∞ ¹ +2.56% (p=0.008 n=5)
Mutes/1000_total,_10_matching 4.047Ki ± ∞ ¹ 3.727Ki ± ∞ ¹ -7.92% (p=0.008 n=5)
Mutes/1000_total,_100_matching 31.27Ki ± ∞ ¹ 29.18Ki ± ∞ ¹ -6.67% (p=0.008 n=5)
Mutes/10000_total,_0_matching 4.047Ki ± ∞ ¹ 3.727Ki ± ∞ ¹ -7.92% (p=0.008 n=5)
Mutes/10000_total,_10_matching 4.047Ki ± ∞ ¹ 3.727Ki ± ∞ ¹ -7.92% (p=0.008 n=5)
Mutes/10000_total,_1000_matching 287.6Ki ± ∞ ¹ 276.1Ki ± ∞ ¹ -3.98% (p=0.008 n=5)
MutesIncremental/1000_base_silences 72.03Ki ± ∞ ¹ 230.60Ki ± ∞ ¹ +220.15% (p=0.008 n=5)
MutesIncremental/3000_base_silences 72.30Ki ± ∞ ¹ 213.23Ki ± ∞ ¹ +194.91% (p=0.008 n=5)
MutesIncremental/7000_base_silences 36.60Ki ± ∞ ¹ 190.82Ki ± ∞ ¹ +421.37% (p=0.008 n=5)
MutesIncremental/10000_base_silences 24.44Ki ± ∞ ¹ 169.77Ki ± ∞ ¹ +594.66% (p=0.008 n=5)
Query/100_silences 2.711Ki ± ∞ ¹ 2.984Ki ± ∞ ¹ +10.09% (p=0.008 n=5)
Query/1000_silences 22.87Ki ± ∞ ¹ 22.39Ki ± ∞ ¹ -2.08% (p=0.008 n=5)
Query/10000_silences 220.7Ki ± ∞ ¹ 220.2Ki ± ∞ ¹ -0.22% (p=0.008 n=5)
QueryParallel/100_silences 2.641Ki ± ∞ ¹ 2.914Ki ± ∞ ¹ +10.36% (p=0.008 n=5)
QueryParallel/1000_silences 22.80Ki ± ∞ ¹ 22.32Ki ± ∞ ¹ -2.09% (p=0.008 n=5)
QueryParallel/10000_silences 220.6Ki ± ∞ ¹ 220.1Ki ± ∞ ¹ -0.22% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_10%_add_rate 161.1Ki ± ∞ ¹ 176.0Ki ± ∞ ¹ +9.28% (p=0.032 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_1%_add_rate 38.07Ki ± ∞ ¹ 39.01Ki ± ∞ ¹ +2.46% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_0.1%_add_rate 24.22Ki ± ∞ ¹ 23.94Ki ± ∞ ¹ -1.13% (p=0.008 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_1%_add_rate 218.6Ki ± ∞ ¹ 218.7Ki ± ∞ ¹ ~ (p=0.310 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_0.1%_add_rate 219.9Ki ± ∞ ¹ 219.8Ki ± ∞ ¹ -0.06% (p=0.008 n=5)
MutesParallel/100_silences 31.27Ki ± ∞ ¹ 29.18Ki ± ∞ ¹ -6.67% (p=0.008 n=5)
MutesParallel/1000_silences 287.6Ki ± ∞ ¹ 276.1Ki ± ∞ ¹ -3.98% (p=0.008 n=5)
MutesParallel/10000_silences 3.228Mi ± ∞ ¹ 2.688Mi ± ∞ ¹ -16.73% (p=0.008 n=5)
geomean 26.38Ki 32.22Ki +22.15%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
There is an increase in memory usage in MutesIncremental, which we could potentially mitigate in the future allowing QSince and QIDS queries in the same call, or allowing query types which return IDs instead of silence lists. The cost is still worth it as it's only in the case of massive silence loads that would make alertmanager nowadays fail, and is minimal compared to the current size of servers.
│ 0-qsince.txt │ 1-qsince.txt │
│ allocs/op │ allocs/op vs base │
Mutes/0_total,_0_matching 4.000 ± ∞ ¹ 4.000 ± ∞ ¹ ~ (p=1.000 n=5) ²
Mutes/1_total,_1_matching 19.00 ± ∞ ¹ 20.00 ± ∞ ¹ +5.26% (p=0.008 n=5)
Mutes/100_total,_10_matching 37.00 ± ∞ ¹ 30.00 ± ∞ ¹ -18.92% (p=0.008 n=5)
Mutes/1000_total,_1_matching 19.00 ± ∞ ¹ 20.00 ± ∞ ¹ +5.26% (p=0.008 n=5)
Mutes/1000_total,_10_matching 37.00 ± ∞ ¹ 30.00 ± ∞ ¹ -18.92% (p=0.008 n=5)
Mutes/1000_total,_100_matching 133.0 ± ∞ ¹ 120.0 ± ∞ ¹ -9.77% (p=0.008 n=5)
Mutes/10000_total,_0_matching 37.00 ± ∞ ¹ 30.00 ± ∞ ¹ -18.92% (p=0.008 n=5)
Mutes/10000_total,_10_matching 37.00 ± ∞ ¹ 30.00 ± ∞ ¹ -18.92% (p=0.008 n=5)
Mutes/10000_total,_1000_matching 1.041k ± ∞ ¹ 1.022k ± ∞ ¹ -1.83% (p=0.008 n=5)
MutesIncremental/1000_base_silences 283.0 ± ∞ ¹ 853.0 ± ∞ ¹ +201.41% (p=0.008 n=5)
MutesIncremental/3000_base_silences 284.0 ± ∞ ¹ 791.0 ± ∞ ¹ +178.52% (p=0.008 n=5)
MutesIncremental/7000_base_silences 156.0 ± ∞ ¹ 710.0 ± ∞ ¹ +355.13% (p=0.008 n=5)
MutesIncremental/10000_base_silences 112.0 ± ∞ ¹ 633.0 ± ∞ ¹ +465.18% (p=0.008 n=5)
Query/100_silences 27.00 ± ∞ ¹ 23.00 ± ∞ ¹ -14.81% (p=0.008 n=5)
Query/1000_silences 120.0 ± ∞ ¹ 114.0 ± ∞ ¹ -5.00% (p=0.008 n=5)
Query/10000_silences 1.023k ± ∞ ¹ 1.017k ± ∞ ¹ -0.59% (p=0.008 n=5)
QueryParallel/100_silences 24.00 ± ∞ ¹ 20.00 ± ∞ ¹ -16.67% (p=0.008 n=5)
QueryParallel/1000_silences 117.0 ± ∞ ¹ 111.0 ± ∞ ¹ -5.13% (p=0.008 n=5)
QueryParallel/10000_silences 1.020k ± ∞ ¹ 1.014k ± ∞ ¹ -0.59% (p=0.008 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_10%_add_rate 748.0 ± ∞ ¹ 818.0 ± ∞ ¹ +9.36% (p=0.032 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_1%_add_rate 183.0 ± ∞ ¹ 184.0 ± ∞ ¹ ~ (p=0.405 n=5)
QueryWithConcurrentAdds/1000_initial_silences,_0.1%_add_rate 124.0 ± ∞ ¹ 119.0 ± ∞ ¹ -4.03% (p=0.008 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_1%_add_rate 1.012k ± ∞ ¹ 1.009k ± ∞ ¹ -0.30% (p=0.008 n=5)
QueryWithConcurrentAdds/10000_initial_silences,_0.1%_add_rate 1.017k ± ∞ ¹ 1.013k ± ∞ ¹ -0.39% (p=0.008 n=5)
MutesParallel/100_silences 133.0 ± ∞ ¹ 120.0 ± ∞ ¹ -9.77% (p=0.008 n=5)
MutesParallel/1000_silences 1.041k ± ∞ ¹ 1.022k ± ∞ ¹ -1.83% (p=0.008 n=5)
MutesParallel/10000_silences 10.05k ± ∞ ¹ 10.02k ± ∞ ¹ -0.33% (p=0.008 n=5)
geomean 154.3 178.8 +15.89%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
GC is not substantially slower, and only spends more memory when it needs to free up memory after a downsize of more than 50%, doing one more allocation then.
goos: linux
goarch: amd64
pkg: github.com/prometheus/alertmanager/silence
cpu: AMD EPYC Processor (with IBPB)
│ gc0.txt │ gc1.txt │
│ sec/op │ sec/op vs base │
GC/100_silences,_0%_expired 10.10µ ± ∞ ¹ 11.52µ ± ∞ ¹ ~ (p=0.333 n=2) ²
GC/100_silences,_10%_expired 11.48µ ± ∞ ¹ 12.60µ ± ∞ ¹ ~ (p=0.333 n=2) ²
GC/100_silences,_50%_expired 15.09µ ± ∞ ¹ 15.89µ ± ∞ ¹ ~ (p=0.333 n=2) ²
GC/100_silences,_90%_expired 18.08µ ± ∞ ¹ 18.78µ ± ∞ ¹ ~ (p=0.333 n=2) ²
GC/1000_silences,_0%_expired 66.35µ ± ∞ ¹ 81.53µ ± ∞ ¹ ~ (p=0.333 n=2) ²
GC/1000_silences,_10%_expired 79.73µ ± ∞ ¹ 92.18µ ± ∞ ¹ ~ (p=0.333 n=2) ²
GC/1000_silences,_50%_expired 121.3µ ± ∞ ¹ 129.0µ ± ∞ ¹ ~ (p=0.333 n=2) ²
GC/1000_silences,_90%_expired 156.6µ ± ∞ ¹ 166.2µ ± ∞ ¹ ~ (p=0.333 n=2) ²
GC/10000_silences,_0%_expired 649.0µ ± ∞ ¹ 1037.9µ ± ∞ ¹ ~ (p=0.333 n=2) ²
GC/10000_silences,_10%_expired 823.0µ ± ∞ ¹ 1193.3µ ± ∞ ¹ ~ (p=0.333 n=2) ²
GC/10000_silences,_50%_expired 1.383m ± ∞ ¹ 1.623m ± ∞ ¹ ~ (p=0.333 n=2) ²
GC/10000_silences,_90%_expired 1.817m ± ∞ ¹ 2.103m ± ∞ ¹ ~ (p=0.333 n=2) ²
geomean 112.8µ 132.6µ +17.49%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
│ gc0.txt │ gc1.txt │
│ B/op │ B/op vs base │
GC/100_silences,_0%_expired 1.109Ki ± ∞ ¹ 1.109Ki ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/100_silences,_10%_expired 1.109Ki ± ∞ ¹ 1.109Ki ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/100_silences,_50%_expired 1.109Ki ± ∞ ¹ 1.109Ki ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/100_silences,_90%_expired 1.109Ki ± ∞ ¹ 1.109Ki ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/1000_silences,_0%_expired 1.109Ki ± ∞ ¹ 1.109Ki ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/1000_silences,_10%_expired 1.109Ki ± ∞ ¹ 1.109Ki ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/1000_silences,_50%_expired 1.109Ki ± ∞ ¹ 1.109Ki ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/1000_silences,_90%_expired 1.109Ki ± ∞ ¹ 1.109Ki ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/10000_silences,_0%_expired 1.109Ki ± ∞ ¹ 1.109Ki ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/10000_silences,_10%_expired 1.109Ki ± ∞ ¹ 1.109Ki ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/10000_silences,_50%_expired 1.109Ki ± ∞ ¹ 1.109Ki ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/10000_silences,_90%_expired 1.109Ki ± ∞ ¹ 121.109Ki ± ∞ ¹ ~ (p=0.333 n=2) ³
geomean 1.109Ki 1.640Ki +47.86%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ need >= 4 samples to detect a difference at alpha level 0.05
│ gc0.txt │ gc1.txt │
│ allocs/op │ allocs/op vs base │
GC/100_silences,_0%_expired 18.00 ± ∞ ¹ 18.00 ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/100_silences,_10%_expired 18.00 ± ∞ ¹ 18.00 ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/100_silences,_50%_expired 18.00 ± ∞ ¹ 18.00 ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/100_silences,_90%_expired 18.00 ± ∞ ¹ 18.00 ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/1000_silences,_0%_expired 18.00 ± ∞ ¹ 18.00 ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/1000_silences,_10%_expired 18.00 ± ∞ ¹ 18.00 ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/1000_silences,_50%_expired 18.00 ± ∞ ¹ 18.00 ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/1000_silences,_90%_expired 18.00 ± ∞ ¹ 18.00 ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/10000_silences,_0%_expired 18.00 ± ∞ ¹ 18.00 ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/10000_silences,_10%_expired 18.00 ± ∞ ¹ 18.00 ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/10000_silences,_50%_expired 18.00 ± ∞ ¹ 18.00 ± ∞ ¹ ~ (p=1.000 n=2) ²
GC/10000_silences,_90%_expired 18.00 ± ∞ ¹ 19.00 ± ∞ ¹ ~ (p=0.333 n=2) ³
geomean 18.00 18.08 +0.45%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ need >= 4 samples to detect a difference at alpha level 0.05
This needs a rebase.