alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

bugfix: only increment Silences version if a silence is added

Open Spaceman1701 opened this issue 1 year ago • 8 comments

This is a small change which adds a second return value to state.merge to differentiate between cases where a silence is updated in place and where a silence is added. This is used by callers to decide if the Silences.version ought to be incremented. It fixes a bug where the Silences.Version() returns a incremented value for operations which don't need to increment the value. In our environment this change results in a significant performance improvement for production workloads because many fewer QMatches queries are issued by Silencer.Mutes.

I believe this is a just a bug for three reasons:

  1. Silences reads "Increments whenever silences are added" (and other operations that mutate the silence state, such as the GC, do not increment the version) (https://github.com/prometheus/alertmanager/blob/main/silence/silence.go#L200)
  2. All application code assumes that the underlying data in a pb.Silence could've changed even if the Silences.Version() hasn't changed. For example, in Silencer.Mutes, we issue a QIDs query to recheck every silence even if the marker version matches the Silences version (https://github.com/prometheus/alertmanager/blob/main/silence/silence.go#L139-L142).
  3. Logically, an individual Silence's state can change without any modification to the Silences object, so no future querier can ever assume that an unchanged version means that the results of a query will not change. Instead, they can only assume that no new silences have been added. This extends further than just the Silence's state since the Silences API does not guarantee that an expired silence will be retained between calls to Query.

All existing tests pass without modification after this change. I believe this exercises some querying logic which further indicates that there's no change to guarantees of the Silences API.

Additionally, I've added conditions to TestSielnceSet which check the result of Silences.Version() before and after each operation. I could not find any existing tests which validate the behavior of Silences.Version(), but this test is the place where it makes the most sense to me. I've limited the guarentee from the API to just "the version changes if and only if a silence with a new ID is added." The test does not enforce that the version increases or changes by only one. The modified test fails without this change, but passes with it:

go test ./silence/
--- FAIL: TestSilenceSet (0.00s)
    silence_test.go:413:
        	Error Trace:	/home/ehunter/nfs-de/oss/forks/alertmanager/silence/silence_test.go:413
        	Error:      	Not equal:
        	            	expected: 3
        	            	actual  : 2
        	Test:       	TestSilenceSet
FAIL
FAIL	github.com/prometheus/alertmanager/silence	0.030s
FAIL

We've also been running this patch in our environment for a while without any issues.

Spaceman1701 avatar Aug 16 '24 16:08 Spaceman1701

In our environment this change results in a significant performance improvement for production workloads because many fewer QMatches queries are issued by Silencer.Mutes.

Can you share some numbers so we can get a sense of how big an improvement this is?

grobinson-grafana avatar Aug 20 '24 11:08 grobinson-grafana

Can you share some numbers so we can get a sense of how big an improvement this is?

Yes, I've attached some graphs which show the performance improvements we've measured from this change. Unfortunately, I'm not able to include the time axis labels, but these graphs all show a period of about a week. The line I've added to mark the deployment is approximate - we run Alertmanger in an HA cluster and not all nodes in are deployed simultaneously. The workload stayed very consistent over this period and we are confident that almost all of the performance changes shown are a result of this patch.

This change had the most impact on our P95 observations. As an example 95th percentile GET /alerts latency would frequently spike up to 5s-10s pre-patch. After the patch, P95 latency was more stable around 1s with occasional spikes between 2s-5s. I think this is because lots of simultaneous silence updates and queries cause contention around the Silences lock. The improvements from this patch are big, but limited to specific workloads which have frequent in-place silence updates. We actually have some more patches that I'm working to get untangled which provide more general silence query performance improvements, but this bugfix is a per-requisite for those.

95thPercentilePostAlerts 95thPercentileGetAlerts 95thPercentileSilencesQueryDuration AveragePostAlertsDuration

Spaceman1701 avatar Aug 20 '24 21:08 Spaceman1701

frequent in-place silence updates. We actually have some more patches that I'm working to get untangled which provide more general silence query performance improvements, but this bugfix is a per-requisite for those.

Just to learn more about how you are using Alertmanager, as it helps us improve the software, it sounds like from what you've written you do a lot of frequent create/updates to silences, is that correct?

There is nothing wrong with that, it's more just an observation as you are the first user I've seen that has reported performance issues due to high churn rate of silences.

grobinson-grafana avatar Aug 20 '24 22:08 grobinson-grafana

Just to learn more about how you are using Alertmanager, as it helps us improve the software, it sounds like from what you've written you do a lot of frequent create/updates to silences, is that correct?

There is nothing wrong with that, it's more just an observation as you are the first user I've seen that has reported performance issues due to high churn rate of silences.

Yep, that's correct. We have something similar to https://github.com/prymitive/kthxbye that updates silences using data from external sources. I imagine that anyone using kthxbye or something like it has a similar amount of churn, but I couldn't guess how common that is. I suspect that a part of why we're hitting these performance problems is that we're running at (what I think is) a larger scale than the average Alertmanager deployment. For example, we have tens of thousands of alerts and tens of thousands of silences.

For what it's worth, I believe that this bugfix (and the other improvements we've made to the silence query logic) are beneficial to everyone even if they have the most impact for users who have similar workloads to us.

Spaceman1701 avatar Aug 21 '24 20:08 Spaceman1701

I read through both the silences and marker code again, and I think this change is safe. At least – I couldn't think of any situations where re-using the version would break existing behavior. @beorn7 as the original author of this code, is it possible I've missed something?

grobinson-grafana avatar Aug 27 '24 15:08 grobinson-grafana

This is so far ago that I don't remember any details like this. Let's go with it and see if it works. :duck:

beorn7 avatar Aug 27 '24 19:08 beorn7

Is there anything else I can provide here to help this one move forward?

Spaceman1701 avatar Sep 10 '24 10:09 Spaceman1701

I think the ball is in the court of the maintainers. @gotjosh @simonpasquier could you make a call here if you want to merge this?

beorn7 avatar Sep 10 '24 10:09 beorn7