fix(dispatch): remove waiting ag routines
This change significantely reduces the number of sleeping go routines created per aggregation group and waiting for a timer tick.
Instead use time.AfterFunc to schedule the next call to flush.
Closes #4503
Do you have any profile captured to show before/after effects of this change?
Yes, it would be nice to post a pprof profile and/or metrics to show the results of this change.
Here are some metrics, in both cases I run the same config for Prometheus and Alertmanager which results in 1500 unique alerts and Aggregation Groups:
From main:
# HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 1532
From this branch:
# HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 32
Looking at pprof/goroutines?debug=1:
From main:
goroutine profile: total 1529
1500 @ 0x100e0e160 0x100dec7cc 0x1016c2480 0x100e16a04
# 0x1016c247f github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run+0x3ff alertmanager/dispatch/dispatch.go:446
...
From this branch no dispatch.(*aggrGroup).run exists to show.
(Note that when flush happens we see a lot of go routines still but those are from notify which we will fix in #4633)
It's less about how many goroutines, but how much this impacts CPU and memory churn. For example, rate(go_memstats_alloc_bytes_total[5m]) can show how much memory is being allocated. Less allocations, less GC, less CPU use.
I think this is a safe change to run in our production canary which has usually ~8k AGs, so I'll back-port it to v0.27 and then compare metrics.