alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Add new behavior to avoid races on config reload

Open Spaceman1701 opened this issue 1 month ago • 1 comments

When the config is reloaded, alertmanager re-creates the inhibit.Inhibitor and the dispatch.Dispatcher. This is necessary because both the Inhibitor and Dispatcher have internal state which depends on the config.

In theory, the re-build should be fast because the provider.Alerts still contains all the active alerts in memory. In practice, it causes some problems. Since the Inhibitor and Dispatcher both ingest alerts one at a time on separate goroutines, their state is built concurrently.

The Inhibitor works by building an internal cache of inhibiting alerts. If an active alert is missing from the Inhibitor's cache, it does not cause inhibitions. This means that an inhibitor with a partially built cache can erroneously return false from Inhibitor.Mutes.

The Dispatcher is responsible for building aggrGroups which in turn flush alerts into the notification pipeline. The notification pipeline calls Inhibitor.Mutes to prevent notifying for inhibited alerts.

This all comes together to cause a problem: If the Dispatcher is able to build an aggrGroup that contains inhibited alerts before the Inhibitor is able to process those alerts, it may cause a incorrect notification to fire. In the worst case, if the Dispatcher is able to build any aggrGroup before the Inhibitor has completed building its cache, we could see incorrect notifications. Essentially, config reloads cause a race condition between the Dispatcher and Inhibitor internal caches.

#4704 largely solves this problem for alertmanager restarts by causing the Dispatcher to delay sending alerts after startup. However, this isn't desirable during config reloads for a number of reasons. Most importantly, config reloads need to be applied to the entire alertmanager cluster all at once. Any artificial delay will delay any notifications from the cluster. In practice, we've seen this as a spike of notifications for inhibited alerts right after a config reload.

Another related problem is the API. If an API function calls Dispatcher.Groups right after a config reload, it might see fewer groups than the in-memory state of alerts would actually create. This is because the disp pointer that the API uses is swapped as soon as the new DIspatcher is constructed. In practice, we've seen this as the /alerts/groups endpoint returning nothing right after a config reload.

This PR adds new mechanisms to avoid all these race conditions. Since the provider.Alerts isn't reconstructed, we just need the Dispatcher to wait for the Inhibitor to process all the alerts which are already in the provider.Alerts. Unfortunately, there's no interface to do that. This PR adds

  1. A new provider.Alerts.SlurpAndSubscribe method which allows implementations of provider.Alerts to return a batch of alerts to the caller immediately, rather than one at a time through a provider.AlertIterator. The implementation in the mem.Alerts is very simple - just return everything that's currently in memory as a batch and then construct the iterator as normal.
  2. Add mechanisms to both the Inhibitor and Dispatcher to indicate whether they're done loading. This just uses a sync.WaitGroup internal to each struct. Both are updated to call SlurpAndSubscribe and indicate loading is finished when they've finished processing the initial batch.
  3. Re-order the logic for reconstructing the Dispatcher and Inhibitor during a config reload.
    1. Reload the Inhibitor first, and then reload the Dispatcher when that's finished
    2. Don't swap the disp pointer to the newly reconstructed Dispatcher until it's done loading. This prevents the API from seeing the wrong number of groups when it calls Dispatcher.Groups.

Spaceman1701 avatar Nov 06 '25 17:11 Spaceman1701

Needs a rebase

Done, thanks for the heads up!

Spaceman1701 avatar Nov 17 '25 15:11 Spaceman1701