prometheus-operator icon indicating copy to clipboard operation
prometheus-operator copied to clipboard

[Feature idea/request] Create alertmanager silences via CRD

Open dharmab opened this issue 6 years ago • 23 comments

It would be nice to create silences in alertmanager via a custom resource definition. Some use cases:

  • I am a cluster operator who needs to upgrade the etcd cluster. I want to create a silence prior to taking down each etcd node and remove the silence after I have finished upgrading the node.
  • I am a developer who needs to update a StatefulSet. I want to create a silence for the duration of the StatefulSet change and remove it when the change is complete.

Is this a good and/or feasible idea? CRDs are easier to deal with than the alertmanager API.

dharmab avatar Feb 12 '19 22:02 dharmab

What would be wrong about talking to the Alertmanager API directly? :thinking: I don't think we need a CRD for this.

/cc @mxinden as he is one of the Alertmanager maintainers

metalmatze avatar Feb 15 '19 14:02 metalmatze

Using apiserver is nice because we already have apiserver connected to our identity provider (in my case, Active Directory.) So the user can use their existing identity, which already using 2FA and visible in audit logs. Or in the case of machine accounts, we don't need to create separate machine credentials for Alertmanager as we can re-use a ServiceAccount.

dharmab avatar Feb 15 '19 20:02 dharmab

I don't think a CRD is necessary for auditing/auth. Just use the kube-rbac-proxy in front of the Alertmanager and create an artificial resource then you can already use the auditing/auth features.

brancz avatar Feb 18 '19 15:02 brancz

As an additional data point, I also think this would be a nice feature. :)

Having a CRD would integrate nicely into the whole Kubernetes ecosystem, e.g. give the possibility to create silences with Helm, quickly list them with kubectl, etc.

It also lowers the overall complexity of the system from an operator point of view, by having one interface to interact with the system (k8s) instead of two (k8s + alertmanager api).

This would also be conceptually identical to the way the prometheus-operator defines Prometheus rules as CRDs, which we've found very easy and convenient to work with here at $work.

Pluies avatar May 29 '19 10:05 Pluies

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

stale[bot] avatar Aug 14 '19 00:08 stale[bot]

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

stale[bot] avatar Oct 20 '19 08:10 stale[bot]

I'm not sure they're identical to Kubernetes APIs, as silences expire and have lots of lifecycle behavior of their own. While I understand the desire, I'm not certain it fits into the "Kubernetes API"-box.

brancz avatar Oct 21 '19 09:10 brancz

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

stale[bot] avatar Dec 20 '19 10:12 stale[bot]

I also think it will simplify the way to manage AM. Of course we can use the API or the amtools, but as a kubernetes user, it is nicer to create the silence through a CRD just by using the kubeclt or the openshift client.

And it will maybe fit a bit better the lifecycle of the k8s object, if the silence would have an infinite end date. Like that it won't expire until the CRD is gone.

In case it's not possible to have this feature in AM (which I can understand), I don't see where is the big issue here to do it with the current AM behavior. I mean the prometheus-operator could read the CRD, push them to AM through the API and then request regularly AM to know if the silences created through the CRD are gone or not.

In case the silences are gone but not expired(because the whole cluster restarted for example), then the prometheus-operator will have to recall the API to create the silence (again). In case the silences are expired, then the CRD are deleted.

At the end it's just matter of being able to synchronized two database.

Nexucis avatar Oct 20 '20 12:10 Nexucis

@brancz does it makes sense what I said above ? If by any chance it's 'ok', I will be glad to do it if you want.

Nexucis avatar Oct 27 '20 14:10 Nexucis

@Nexucis At Giant Swarm we're managing lots of alertmanagers. We have this issue of silencing different kinds of alerts and managing a history of changes. For now, we've solved this issue with https://github.com/giantswarm/silence-operator . It is a bit specific to our needs, but even if you don't need to sync git repo with silences into your k8s clusters(kind of gitops way), you can use this operator with minimum CRs like the following (matches correspond to silence matches in alertmanager)

apiVersion: monitoring.giantswarm.io/v1alpha1
kind: Silence
metadata:
  name: test-silence
spec:
  targetTags: []
  matchers:
  - name: cluster
    value: test
    isRegex: false

There is no expire date. While CR exists - silence exists.

corest avatar Nov 09 '20 23:11 corest

@corest very cool! I will share that with my team!

dharmab avatar Nov 09 '20 23:11 dharmab

@brancz would it make sense to work on a documentation to explain all the possible use cases for it and see if this is worth implementing in prometheus operator?

QuentinBisson avatar Feb 23 '21 09:02 QuentinBisson

Yeah I think if the use cases are sound we can talk about it! So far I'm not convinced that permanent silences aren't better off as being inhibition rules or routes that blackhole the alerts, both of which can be specified via the AlertmanagerConfig CRD already. Even if that's already possible though, we should document it, so having the use cases definitely helps either way!

brancz avatar Feb 23 '21 13:02 brancz

One of the main use cases we have at Giant Swarm is to be able to apply the same silence across multiple clusters in a GitOps fashion (defined as desired state so it easier to track them). Our silence operator, as explained by @corest renders the silence CR and applies them to Alertmanager. Them not expiring is mostly an implementation detail.

After thinking this through, maybe it would make sense to define inhibitions as first class citizens (i.e. CRD) instead of as another prometheus rule? In the end, this would become rules, but having a way to define inhibitions could also make sense. This CR could for example define on which rules or labels it should be applied instead of applying it on all the rules it should inhibit?

QuentinBisson avatar Mar 29 '21 10:03 QuentinBisson

Yeah I think this makes sense. For what it's worth, we are already able to specify inhibition rules via the new AlertmanagerConfig CRD.

brancz avatar Apr 15 '21 12:04 brancz

@Nexucis At Giant Swarm we're managing lots of alertmanagers. We have this issue of silencing different kinds of alerts and managing a history of changes. For now, we've solved this issue with https://github.com/giantswarm/silence-operator . It is a bit specific to our needs, but even if you don't need to sync git repo with silences into your k8s clusters(kind of gitops way), you can use this operator with minimum CRs like the following (matches correspond to silence matches in alertmanager)

apiVersion: monitoring.giantswarm.io/v1alpha1
kind: Silence
metadata:
  name: test-silence
spec:
  targetTags: []
  matchers:
  - name: cluster
    value: test
    isRegex: false

There is no expire date. While CR exists - silence exists.

@corest, I have an issue while I have tried to use this. unable to recognize "STDIN": no matches for kind "Silence" in version "monitoring.giantswarm.io/v1alpha1" Is it possible to let me know what can cause this error?

ghazal-naderi avatar Apr 19 '21 09:04 ghazal-naderi

Sharing a possible use case for silence CRD here:

we have built a platform to manage multi-k8s-clusters, and we want to integrate alerting into the platform. It would be helpful to have CRD, so that we can talk to apiserver directly rather than having to connect to every alertmanager in all managed clusters, which requires extra ingresses and configs for service-discovery.

just1900 avatar Jun 20 '21 16:06 just1900

managing silences via the kubernetes api would also be beneficial for gitops cluster management. @brancz made a good point about silences having a non trivial livecycle with expiration etc. - a silence CRD would be one way of handling this, at least in those situations where the silence is somehow permanent e.g. when a contextual irrelevant alerting rule can not be disabled (e.g. openshift cluster monitoring operator)

geoberle avatar Jun 23 '21 17:06 geoberle

Any status update on this? Was this issue just forgotten? 😁

davidpanic avatar Dec 27 '23 12:12 davidpanic

FYI: Looks like sth is emerging in the community https://github.com/jacksgt/alert-operator 🤗

bwplotka avatar Jul 29 '24 21:07 bwplotka

From what I see, it doesn't support create/update operations (yet?). I suppose that https://github.com/giantswarm/silence-operator is more accomplished (it was mentioned in the first comments).

simonpasquier avatar Jul 30 '24 09:07 simonpasquier

And to be clear, we have a consensus that it would be useful but someone needs to work on it (see #5485).

simonpasquier avatar Jul 30 '24 09:07 simonpasquier