prometheus-engine icon indicating copy to clipboard operation
prometheus-engine copied to clipboard

Persisting silences in alertmanager

Open marwanad opened this issue 1 year ago • 9 comments

In the managed alertmanager, alertmanager-data is of EmptyDir which means that configured silences and notification states won't persist on pod restarts. Is there a way to have a configurable PVC for the data dir with the managed alertmanager?

marwanad avatar Nov 28 '23 22:11 marwanad

That's correct, thanks for raising this.

Alertmanager is a statefulset, but with best-effort emptyDir volume which does not guarantee any persistence. In self-deployment that's possible, since you can modify the Alertmanager resource, but not in managed GMP.

We could discuss this feature as a team if you want, it feels like something we could consider, but of lower priority. Also help wanted to contribute this feature, might get it done faster.

Just curious what's your use case for managed alertmanager? Would our recent cloud feature in preview PromQL for Cloud Monitoring Alerting help?

bwplotka avatar Nov 29 '23 08:11 bwplotka

@bwplotka thanks for the response! I think there was no way to disable the deployment of the managed alertmanager through the GMP operator at the time so we ended up utilizing it instead of having duplicate deployments.

So it's basically the same use-case for an unmanaged alertmanager, at the time we couldn't define PromQL rules in cloud monitoring + we needed more control over the slack notification channel configs, pagerduty etc. The preview feature looks interesting and covers a subset of our use-case but we'll still need alertmanager for generic webhook channels.

marwanad avatar Nov 29 '23 22:11 marwanad

Note that Cloud Alerting PromQL does support generic webhook channels: https://cloud.google.com/monitoring/support/notification-options#webhooks

lyanco avatar Nov 30 '23 15:11 lyanco

We are facing the same problem. All of our silences are gone on pod restart and we need to recreate all of them manually. In the last two weeks it happened two times. So this improvement would also be very helpful for us!

taldejoh avatar Dec 19 '23 08:12 taldejoh

Sorry for lag, it's on our radar again, we are brainstorming how to enable persistent volumes here.

Interestingly there is a very nasty "persistent" workaround for silences in the meantime https://github.com/prometheus/alertmanager/issues/1673#issuecomment-819421068 (thanks @TheSpiritXIII for the finding!)

bwplotka avatar Mar 28 '24 16:03 bwplotka

Just quick question to users who care about this feature, which managed collection (this operator) deployment model you use?

1️⃣ the one available on GKE (fully managed). If that's the case, how you submit the silences? 2️⃣ self-deployed operator (via kubectl). If that's the case, what stops you from manually adjusting Alertmanager Statefulset YAML for your needs and re-applying it? Operator will managed that one (as long as you keep the labels, namespace and name the same) just fine.

cc @m3adow @marwanad @taldejoh

bwplotka avatar Apr 08 '24 15:04 bwplotka

@bwplotka appreciate the updates on this :)

We were using option 1 and setting the silences by port-forwarding to the running alertmanager instance and adding them through the UI or using amtool to submit them.

We've then switched to a self deployed alertmanager instance to get more control over this and setting alertmanagers field in the operator config to point to our self-managed instance.

marwanad avatar Apr 08 '24 16:04 marwanad

We're using option 1 as well. We're currently in the process of migrating from kube-prometheus-stack to GMP and we want to have as much of the "GM", as possible. 😄
Right now, we're also using port-forwarding and the UI to silence alerts. As the alerts are sent to Teams channels, we don't have an option to silence the alerts later on in the alerting chain.

m3adow avatar Apr 09 '24 08:04 m3adow

Epic, thanks for clarifications!

bwplotka avatar Apr 09 '24 08:04 bwplotka