alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Fix race conditions in the memory alerts store

Open damnever opened this issue 2 years ago • 3 comments

The main branch will easily fail the newly added test case:

=== RUN   TestAlertsConcurrently
    mem_test.go:565:
                Error Trace:    /prometheus-io/alertmanager/provider/mem/mem_test.go:565
                Error:          Not equal:
                                expected: 0
                                actual  : -171
                Test:           TestAlertsConcurrently

There are multiple race conditions in the provider/mem:

  1. Any Put or GC operation that occurs concurrently with this code block will introduce a race condition https://github.com/prometheus/alertmanager/blob/c920b605b647f2548e23a3dc2fea014b9c9dbf43/provider/mem/mem.go#L204-L228
  2. A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.

@simonpasquier @w0rm @gotjosh please take a look

damnever avatar Dec 28 '23 06:12 damnever

@simonpasquier @w0rm @gotjosh please take a look

damnever avatar Mar 05 '24 06:03 damnever

@simonpasquier @gotjosh is this on your radar?

beorn7 avatar Apr 30 '24 08:04 beorn7

@grobinson-grafana would you also mind taking a look at this?

damnever avatar May 06 '24 02:05 damnever

A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.

Apologies for asking lots of questions, but I would like to understand this case. I can see there is a race condition between the Get and Set operations in Put because 1.) two or more goroutines can call Put at the same time and 2.) the calls to Get and Set can be interleaved with a gc operation on the store, causing the alert to be deleted and the added back. I agree this needs to be fixed.

What I would still like to understand is how would the alert be missed? The mutex is acquired in Subscribe and before the alert is sent to the listeners, so it should not happen that a new listener misses an alert? Would it be possible to show how this happens with a test?

grobinson-grafana avatar May 13 '24 16:05 grobinson-grafana

I also opened a draft PR https://github.com/prometheus/alertmanager/pull/3840 that builds on this fix. It removes gc and the callback from store.go.

grobinson-grafana avatar May 13 '24 21:05 grobinson-grafana

A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.

Apologies for asking lots of questions, but I would like to understand this case. I can see there is a race condition between the Get and Set operations in Put because 1.) two or more goroutines can call Put at the same time and 2.) the calls to Get and Set can be interleaved with a gc operation on the store, causing the alert to be deleted and the added back. I agree this needs to be fixed.

What I would still like to understand is how would the alert be missed? The mutex is acquired in Subscribe and before the alert is sent to the listeners, so it should not happen that a new listener misses an alert? Would it be possible to show how this happens with a test?

This all comes together, not as an independent case. The reason is that we can delete alerts solely in store.Alerts without the lock from mem.Alerts.mtx. Imagine if the gc happens right after the Put(before the fix), then the listener might miss some alerts due to the race condition.

damnever avatar May 14 '24 02:05 damnever

A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.

Apologies for asking lots of questions, but I would like to understand this case. I can see there is a race condition between the Get and Set operations in Put because 1.) two or more goroutines can call Put at the same time and 2.) the calls to Get and Set can be interleaved with a gc operation on the store, causing the alert to be deleted and the added back. I agree this needs to be fixed. What I would still like to understand is how would the alert be missed? The mutex is acquired in Subscribe and before the alert is sent to the listeners, so it should not happen that a new listener misses an alert? Would it be possible to show how this happens with a test?

This all comes together, not as an independent case. The reason is that we can delete alerts solely in store.Alerts without the lock from mem.Alerts.mtx. Imagine if the gc happens right after the Put(before the fix), then the listener might miss some alerts due to the race condition.

The reason is that we can delete alerts solely in store.Alerts without the lock from mem.Alerts.mtx.

I agree!

Imagine if the gc happens right after the Put(before the fix), then the listener might miss some alerts due to the race condition.

In this case the listener will receive two events: an event for the Put and a second event for the GC? It can receive them out of order, but I don't think events can be lost?

The reason I think that is the alerts that are sent to a.listeners in Put comes from the argument alerts and not from the alerts in the mem store:

func (a *Alerts) Put(alerts ...*types.Alert) error {
	for _, alert := range alerts {
		...
		a.mtx.Lock()
		for _, l := range a.listeners {
			select {
			case l.alerts <- alert:
			case <-l.done:
			}
		}
		a.mtx.Unlock()
	}

That means even when there is a gc between Set and range a.listeners, the gc operation cannot stop these alerts from being sent to the listeners. The listeners might receive the events out of order, but I still don't see how an event can be missed?

grobinson-grafana avatar May 14 '24 07:05 grobinson-grafana

okay, they cannot be stopped, but this is a race.

damnever avatar May 14 '24 08:05 damnever

@grobinson-grafana I have rebased with the main branch and changed some locks to use defer. Please take another look. If there are no major change requirements, I believe we should merge this first and make further improvements as needed.

damnever avatar May 14 '24 08:05 damnever

Thank you very much for your contribution!

gotjosh avatar May 16 '24 10:05 gotjosh