alertmanager fix(dispatch): remove waiting ag routines

This change significantely reduces the number of sleeping go routines created per aggregation group and waiting for a timer tick.

Instead use time.AfterFunc to schedule the next call to flush.

Closes #4503

Oct 27 '25 11:10 siavashs

Do you have any profile captured to show before/after effects of this change?

Oct 27 '25 16:10 rajagopalanand

Yes, it would be nice to post a pprof profile and/or metrics to show the results of this change.

Oct 27 '25 17:10 SuperQ

Here are some metrics, in both cases I run the same config for Prometheus and Alertmanager which results in 1500 unique alerts and Aggregation Groups:

From main:

# HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 1532

From this branch:

# HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 32

Looking at pprof/goroutines?debug=1:

From main:

goroutine profile: total 1529
1500 @ 0x100e0e160 0x100dec7cc 0x1016c2480 0x100e16a04
#	0x1016c247f	github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run+0x3ff	alertmanager/dispatch/dispatch.go:446
...

From this branch no dispatch.(*aggrGroup).run exists to show.

(Note that when flush happens we see a lot of go routines still but those are from notify which we will fix in #4633)

Oct 27 '25 20:10 siavashs

It's less about how many goroutines, but how much this impacts CPU and memory churn. For example, rate(go_memstats_alloc_bytes_total[5m]) can show how much memory is being allocated. Less allocations, less GC, less CPU use.

Oct 28 '25 07:10 SuperQ

I think this is a safe change to run in our production canary which has usually ~8k AGs, so I'll back-port it to v0.27 and then compare metrics.

Oct 28 '25 08:10 siavashs