gcs icon indicating copy to clipboard operation
gcs copied to clipboard

alertmanager-alert-* pods fail to start

Open JohnStrunk opened this issue 6 years ago • 5 comments

When deploying GCS, the alertmanager pods fail to start:

$ kubectl --kubeconfig=kubeconfig -nmonitoring get po
NAME                                   READY     STATUS              RESTARTS   AGE
alertmanager-alert-0                   0/2       ContainerCreating   0          38m
grafana-78cb8848f6-wgv95               1/1       Running             0          38m
prometheus-operator-6c4b6cfc76-sqt29   1/1       Running             0          39m
prometheus-prometheus-0                3/3       Running             1          38m
prometheus-prometheus-1                3/3       Running             1          38m

This seems to be due to a missing secret:

$ kubectl --kubeconfig=kubeconfig -nmonitoring describe po/alertmanager-alert-0
...
Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Normal   Scheduled    38m                default-scheduler  Successfully assigned monitoring/alertmanager-alert-0 to kube3
  Warning  FailedMount  8m (x23 over 38m)  kubelet, kube3     MountVolume.SetUp failed for volume "config-volume" : secret "alertmanager-alert" not found
  Warning  FailedMount  2m (x16 over 36m)  kubelet, kube3     Unable to mount volumes for pod "alertmanager-alert-0_monitoring(7ec5b71c-12bd-11e9-b6f3-5254008efbd2)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"alertmanager-alert-0". list of unmounted volumes=[config-volume]. list of unattached volumes=[config-volume alertmanager-alert-db default-token-7hg4n]

Taking the secret from here: https://raw.githubusercontent.com/coreos/prometheus-operator/master/contrib/kube-prometheus/manifests/alertmanager-secret.yaml and inserting it into the monitoring namespace as alertmanager-alert seems to fix the problem:

$ kubectl --kubeconfig=kubeconfig -nmonitoring get po
NAME                                   READY     STATUS    RESTARTS   AGE
alertmanager-alert-0                   2/2       Running   0          8m
alertmanager-alert-1                   2/2       Running   0          9m
alertmanager-alert-2                   2/2       Running   0          8m
grafana-78cb8848f6-wgv95               1/1       Running   0          48m
prometheus-operator-6c4b6cfc76-sqt29   1/1       Running   0          48m
prometheus-prometheus-0                3/3       Running   1          48m
prometheus-prometheus-1                3/3       Running   1          47m

We probably shouldn't be blindly applying Secrets as I have done here... There's supposed to be secret, afterall :lock:. We should probably be generating our own and applying it in the proper places.

JohnStrunk avatar Jan 07 '19 21:01 JohnStrunk

"global": 
  "resolve_timeout": "5m"
"receivers": 
- "name": "null"
"route": 
  "group_by": 
  - "job"
  "group_interval": "5m"
  "group_wait": "30s"
  "receiver": "null"
  "repeat_interval": "12h"
  "routes": 
  - "match": 
      "alertname": "DeadMansSwitch"
    "receiver": "null"

This is the Base64 decode of the Secret that @JohnStrunk applied.

Applying this secret fixes the issue of alertmanager-alert being in the ContainerCreating state but it still doesn't configure the alertmanager to do anything meaningful.

Ideally, the secret applied should contain information about the alertmanager configuration.

global:
  resolve_timeout: 5m

route:
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 2m
  receiver: 'slack'

receivers:
- name: 'slack'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/...'
    channel: test-alerts

This is the minimal configuration that I use to send alerts to the test-alerts channel on redhat.slack

This being said, there are many ways in which it can be configured and generating secret is dependent on that. https://github.com/prometheus/alertmanager/blob/master/doc/examples/simple.yml https://prometheus.io/docs/alerting/configuration/#configuration-file

So, if we provide our own secret, how do we congifure it in a way that is applicable for common use cases?

umangachapagain avatar Jan 08 '19 05:01 umangachapagain

I'm not sure we can do better than providing:

  • a placeholder config that allows the system to start
  • documentation on how to configure it properly

Alert configurations are very environment dependent, and I'm not sure we can even come up w/ a reasonable "demo" setup.

JohnStrunk avatar Jan 14 '19 16:01 JohnStrunk

As provided by prometheus-operator[1]. The mixins can hold the config and documentation for now in extras directory.

Moving forward the configurations are being handled by Prometheus operator in OKD. Because monitoring stack is handled by them.

Alert configurations are very environment dependent, and I'm not sure we can even come up w/ a reasonable "demo" setup.

I think that should be possible If we add the config to the mixins and build and apply it via a container.

@JohnStrunk thoughts? [1]. https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus#alertmanager-configuration

cloudbehl avatar Jan 14 '19 22:01 cloudbehl

I'm not concerned about how to provide the file, it what the file should contain. From briefly looking at the config, it seems concerned with where to push alert notifications. That's the part that I don't think we have a generic answer for. The demo environment doesn't have any "receivers". Am I misunderstanding the problem?

JohnStrunk avatar Jan 15 '19 18:01 JohnStrunk

Agree with @JohnStrunk . Default Secret + Documentation to configure is the best way to go.

umangachapagain avatar Jan 17 '19 05:01 umangachapagain