Fix race conditions in the memory alerts store
The main branch will easily fail the newly added test case:
=== RUN TestAlertsConcurrently
mem_test.go:565:
Error Trace: /prometheus-io/alertmanager/provider/mem/mem_test.go:565
Error: Not equal:
expected: 0
actual : -171
Test: TestAlertsConcurrently
There are multiple race conditions in the provider/mem:
- Any Put or GC operation that occurs concurrently with this code block will introduce a race condition https://github.com/prometheus/alertmanager/blob/c920b605b647f2548e23a3dc2fea014b9c9dbf43/provider/mem/mem.go#L204-L228
- A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.
@simonpasquier @w0rm @gotjosh please take a look
@simonpasquier @w0rm @gotjosh please take a look
@simonpasquier @gotjosh is this on your radar?
@grobinson-grafana would you also mind taking a look at this?
A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.
Apologies for asking lots of questions, but I would like to understand this case. I can see there is a race condition between the Get and Set operations in Put because 1.) two or more goroutines can call Put at the same time and 2.) the calls to Get and Set can be interleaved with a gc operation on the store, causing the alert to be deleted and the added back. I agree this needs to be fixed.
What I would still like to understand is how would the alert be missed? The mutex is acquired in Subscribe and before the alert is sent to the listeners, so it should not happen that a new listener misses an alert? Would it be possible to show how this happens with a test?
I also opened a draft PR https://github.com/prometheus/alertmanager/pull/3840 that builds on this fix. It removes gc and the callback from store.go.
A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.
Apologies for asking lots of questions, but I would like to understand this case. I can see there is a race condition between the
GetandSetoperations inPutbecause 1.) two or more goroutines can callPutat the same time and 2.) the calls toGetandSetcan be interleaved with a gc operation on the store, causing the alert to be deleted and the added back. I agree this needs to be fixed.What I would still like to understand is how would the alert be missed? The mutex is acquired in
Subscribeand before the alert is sent to the listeners, so it should not happen that a new listener misses an alert? Would it be possible to show how this happens with a test?
This all comes together, not as an independent case. The reason is that we can delete alerts solely in store.Alerts without the lock from mem.Alerts.mtx. Imagine if the gc happens right after the Put(before the fix), then the listener might miss some alerts due to the race condition.
A race condition between Put() and Subscribe() can cause some newly added Alerts to be missed.
Apologies for asking lots of questions, but I would like to understand this case. I can see there is a race condition between the
GetandSetoperations inPutbecause 1.) two or more goroutines can callPutat the same time and 2.) the calls toGetandSetcan be interleaved with a gc operation on the store, causing the alert to be deleted and the added back. I agree this needs to be fixed. What I would still like to understand is how would the alert be missed? The mutex is acquired inSubscribeand before the alert is sent to the listeners, so it should not happen that a new listener misses an alert? Would it be possible to show how this happens with a test?This all comes together, not as an independent case. The reason is that we can delete alerts solely in store.Alerts without the lock from mem.Alerts.mtx. Imagine if the gc happens right after the Put(before the fix), then the listener might miss some alerts due to the race condition.
The reason is that we can delete alerts solely in store.Alerts without the lock from mem.Alerts.mtx.
I agree!
Imagine if the gc happens right after the Put(before the fix), then the listener might miss some alerts due to the race condition.
In this case the listener will receive two events: an event for the Put and a second event for the GC? It can receive them out of order, but I don't think events can be lost?
The reason I think that is the alerts that are sent to a.listeners in Put comes from the argument alerts and not from the alerts in the mem store:
func (a *Alerts) Put(alerts ...*types.Alert) error {
for _, alert := range alerts {
...
a.mtx.Lock()
for _, l := range a.listeners {
select {
case l.alerts <- alert:
case <-l.done:
}
}
a.mtx.Unlock()
}
That means even when there is a gc between Set and range a.listeners, the gc operation cannot stop these alerts from being sent to the listeners. The listeners might receive the events out of order, but I still don't see how an event can be missed?
okay, they cannot be stopped, but this is a race.
@grobinson-grafana I have rebased with the main branch and changed some locks to use defer. Please take another look. If there are no major change requirements, I believe we should merge this first and make further improvements as needed.
Thank you very much for your contribution!