Silence count metric collection
Currently silence metric collection happens during scrape time. In scenarios where AlertManager is under heavy load, lock contention can occur and causes high latency in scraping. One such scenario is when there are lots of aggregation groups and new silences are being added
Would it be acceptable to collect silences count in the background instead of collecting it at the time of scraping? Doing so reduces latency in scraping by removing lock contention at the time of scraping. Lock contention can still occur in the Goroutine.
Profile captured during high latency in scraping
-----------+-------------------------------------------------------
runtime.gopark build/lib/src/runtime/proc.go:424
runtime.goparkunlock build/lib/src/runtime/proc.go:430 (inline)
runtime.semacquire1 build/lib/src/runtime/sema.go:178
sync.runtime_SemacquireMutex build/lib/src/runtime/sema.go:95
sync.(*Mutex).lockSlow build/lib/src/sync/mutex.go:173
sync.(*Mutex).Lock build/lib/src/sync/mutex.go:92 (inline)
sync.(*RWMutex).Lock build/lib/src/sync/rwmutex.go:148
github.com/prometheus/alertmanager/silence.(*Silences).Query /build/gopath/src/github.com/prometheus/alertmanager/silence/silence.go:797
github.com/prometheus/alertmanager/silence.(*Silencer).Mutes /build/gopath/src/github.com/prometheus/alertmanager/silence/silence.go:145
github.com/prometheus/alertmanager/notify.(*MuteStage).Exec /build/gopath/src/github.com/prometheus/alertmanager/notify/notify.go:599
github.com/prometheus/alertmanager/notify.MultiStage.Exec /build/gopath/src/github.com/prometheus/alertmanager/notify/notify.go:512
github.com/prometheus/alertmanager/notify.RoutingStage.Exec /build/gopath/src/github.com/prometheus/alertmanager/notify/notify.go:495
github.com/prometheus/alertmanager/dispatch.(*Dispatcher).processAlert.func1 /build/gopath/src/github.com/prometheus/alertmanager/dispatch/dispatch.go:423
github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run.func1 /build/gopath/src/github.com/prometheus/alertmanager/dispatch/dispatch.go:548
github.com/prometheus/alertmanager/dispatch.(*aggrGroup).flush /build/gopath/src/github.com/prometheus/alertmanager/dispatch/dispatch.go:611
github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run /build/gopath/src/github.com/prometheus/alertmanager/dispatch/dispatch.go:547
-----------+-------------------------------------------------------
runtime.gopark build/lib/src/runtime/proc.go:424
runtime.goparkunlock build/lib/src/runtime/proc.go:430 (inline)
runtime.semacquire1 build/lib/src/runtime/sema.go:178
sync.runtime_SemacquireMutex build/lib/src/runtime/sema.go:95
sync.(*Mutex).lockSlow build/lib/src/sync/mutex.go:173
sync.(*Mutex).Lock build/lib/src/sync/mutex.go:92 (inline)
sync.(*RWMutex).Lock build/lib/src/sync/rwmutex.go:148
github.com/prometheus/alertmanager/silence.(*Silences).Query /build/gopath/src/github.com/prometheus/alertmanager/silence/silence.go:797
github.com/prometheus/alertmanager/silence.(*Silences).CountState /build/gopath/src/github.com/prometheus/alertmanager/silence/silence.go:827
github.com/prometheus/alertmanager/silence.newSilenceMetricByState.func1 /build/gopath/src/github.com/prometheus/alertmanager/silence/silence.go:242
github.com/prometheus/client_golang/prometheus.(*valueFunc).Write /build/gopath/src/github.com/prometheus/client_golang/prometheus/value.go:95
github.com/prometheus/client_golang/prometheus.processMetric /build/gopath/src/github.com/prometheus/client_golang/prometheus/registry.go:633
github.com/prometheus/client_golang/prometheus.(*Registry).Gather /build/gopath/src/github.com/prometheus/client_golang/prometheus/registry.go:502
-----------+-------------------------------------------------------
PR to collect silence counts in a separate goroutine
I'm more interested to see if this can be made faster instead of offloading it to a goroutine. There is a comment in CountState:
// This could probably be optimized.
Perhaps first look at how to make it faster? The lock still has been acquired, so I would assume under very heavy load, you're just scraping stale silence metrics.
I'm more interested to see if this can be made faster instead of offloading it to a goroutine. There is a comment in
CountState:// This could probably be optimized.
Perhaps first look at how to make it faster? The lock still has been acquired, so I would assume under very heavy load, you're just scraping stale silence metrics.
Can investigate if there are any improvements that could be made. Just want to note that counting silences holds up collection of other metrics too