alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

reduce the time Dispatch.Group holds the mutex

Open Spaceman1701 opened this issue 2 months ago • 0 comments

Groups calls can take a long time when there are many aggrGroups or when on of the filter functions is slow. Right now, Groups holds the Dispatcher lock for the entire duration of Groups. This is aggravated by the fact that the API passes filter functions which themselves call Silences.Mutes and Inhibits.Mutes which themselves hold locks.

Since the Dispatcher needs to hold a write lock on Dispatcher.mtx in order to ingest alerts, Groups calls essentially block alert ingestion. Since Groups depends on Silences.Mutes, this also means that calls to Silences.Mutes can block ingestion. Since that blocks all the various Silences API endpoints and some of the gossip channels, this becomes. big knot of locks which causes the alertmanager to hang up if something is hammering GET /alerts/groups. Unfortunately, many dashboard services do just that.

This patch just copies the aggrGroupsPerRoute map out of the dispatcher and then releases the lock for the rest of the Groups call. This ensures that we never need to hold the dispatcher lock and the silencer or inhibitor locks at the same time.

We've been running this patch in production for quite a while now. We've found that performance is substantially improved, especially around startup time (when the silencer/inhibitor are both extra slow). We've also measured much less mutex contention after adding this.

I don't have any synthetic benchmarks for this one unfortunately.

Here's some profiling comparisons of an alertmanager restart before and after the patch: Screenshot 2024-12-23 at 4 09 06 PM Screenshot 2024-12-23 at 4 02 40 PM

Spaceman1701 avatar Oct 30 '25 17:10 Spaceman1701