alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Silence count metric collection

Open rajagopalanand opened this issue 8 months ago • 2 comments

Currently silence metric collection happens during scrape time. In scenarios where AlertManager is under heavy load, lock contention can occur and causes high latency in scraping. One such scenario is when there are lots of aggregation groups and new silences are being added

Would it be acceptable to collect silences count in the background instead of collecting it at the time of scraping? Doing so reduces latency in scraping by removing lock contention at the time of scraping. Lock contention can still occur in the Goroutine.

Profile captured during high latency in scraping

-----------+-------------------------------------------------------
             runtime.gopark build/lib/src/runtime/proc.go:424
             runtime.goparkunlock build/lib/src/runtime/proc.go:430 (inline)
             runtime.semacquire1 build/lib/src/runtime/sema.go:178
             sync.runtime_SemacquireMutex build/lib/src/runtime/sema.go:95
             sync.(*Mutex).lockSlow build/lib/src/sync/mutex.go:173
             sync.(*Mutex).Lock build/lib/src/sync/mutex.go:92 (inline)
             sync.(*RWMutex).Lock build/lib/src/sync/rwmutex.go:148
             github.com/prometheus/alertmanager/silence.(*Silences).Query /build/gopath/src/github.com/prometheus/alertmanager/silence/silence.go:797
             github.com/prometheus/alertmanager/silence.(*Silencer).Mutes /build/gopath/src/github.com/prometheus/alertmanager/silence/silence.go:145
             github.com/prometheus/alertmanager/notify.(*MuteStage).Exec /build/gopath/src/github.com/prometheus/alertmanager/notify/notify.go:599
             github.com/prometheus/alertmanager/notify.MultiStage.Exec /build/gopath/src/github.com/prometheus/alertmanager/notify/notify.go:512
             github.com/prometheus/alertmanager/notify.RoutingStage.Exec /build/gopath/src/github.com/prometheus/alertmanager/notify/notify.go:495
             github.com/prometheus/alertmanager/dispatch.(*Dispatcher).processAlert.func1 /build/gopath/src/github.com/prometheus/alertmanager/dispatch/dispatch.go:423
             github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run.func1 /build/gopath/src/github.com/prometheus/alertmanager/dispatch/dispatch.go:548
             github.com/prometheus/alertmanager/dispatch.(*aggrGroup).flush /build/gopath/src/github.com/prometheus/alertmanager/dispatch/dispatch.go:611
             github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run /build/gopath/src/github.com/prometheus/alertmanager/dispatch/dispatch.go:547
-----------+-------------------------------------------------------
             runtime.gopark build/lib/src/runtime/proc.go:424
             runtime.goparkunlock build/lib/src/runtime/proc.go:430 (inline)
             runtime.semacquire1 build/lib/src/runtime/sema.go:178
             sync.runtime_SemacquireMutex build/lib/src/runtime/sema.go:95
             sync.(*Mutex).lockSlow build/lib/src/sync/mutex.go:173
             sync.(*Mutex).Lock build/lib/src/sync/mutex.go:92 (inline)
             sync.(*RWMutex).Lock build/lib/src/sync/rwmutex.go:148
             github.com/prometheus/alertmanager/silence.(*Silences).Query /build/gopath/src/github.com/prometheus/alertmanager/silence/silence.go:797
             github.com/prometheus/alertmanager/silence.(*Silences).CountState /build/gopath/src/github.com/prometheus/alertmanager/silence/silence.go:827
             github.com/prometheus/alertmanager/silence.newSilenceMetricByState.func1 /build/gopath/src/github.com/prometheus/alertmanager/silence/silence.go:242
             github.com/prometheus/client_golang/prometheus.(*valueFunc).Write /build/gopath/src/github.com/prometheus/client_golang/prometheus/value.go:95
             github.com/prometheus/client_golang/prometheus.processMetric /build/gopath/src/github.com/prometheus/client_golang/prometheus/registry.go:633
             github.com/prometheus/client_golang/prometheus.(*Registry).Gather /build/gopath/src/github.com/prometheus/client_golang/prometheus/registry.go:502
-----------+-------------------------------------------------------   

PR to collect silence counts in a separate goroutine

rajagopalanand avatar Apr 19 '25 15:04 rajagopalanand

I'm more interested to see if this can be made faster instead of offloading it to a goroutine. There is a comment in CountState:

// This could probably be optimized.

Perhaps first look at how to make it faster? The lock still has been acquired, so I would assume under very heavy load, you're just scraping stale silence metrics.

grobinson-grafana avatar Apr 20 '25 11:04 grobinson-grafana

I'm more interested to see if this can be made faster instead of offloading it to a goroutine. There is a comment in CountState:

// This could probably be optimized.

Perhaps first look at how to make it faster? The lock still has been acquired, so I would assume under very heavy load, you're just scraping stale silence metrics.

Can investigate if there are any improvements that could be made. Just want to note that counting silences holds up collection of other metrics too

rajagopalanand avatar Apr 20 '25 19:04 rajagopalanand