gcs
gcs copied to clipboard
alertmanager-alert-* pods fail to start
When deploying GCS, the alertmanager pods fail to start:
$ kubectl --kubeconfig=kubeconfig -nmonitoring get po
NAME READY STATUS RESTARTS AGE
alertmanager-alert-0 0/2 ContainerCreating 0 38m
grafana-78cb8848f6-wgv95 1/1 Running 0 38m
prometheus-operator-6c4b6cfc76-sqt29 1/1 Running 0 39m
prometheus-prometheus-0 3/3 Running 1 38m
prometheus-prometheus-1 3/3 Running 1 38m
This seems to be due to a missing secret:
$ kubectl --kubeconfig=kubeconfig -nmonitoring describe po/alertmanager-alert-0
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 38m default-scheduler Successfully assigned monitoring/alertmanager-alert-0 to kube3
Warning FailedMount 8m (x23 over 38m) kubelet, kube3 MountVolume.SetUp failed for volume "config-volume" : secret "alertmanager-alert" not found
Warning FailedMount 2m (x16 over 36m) kubelet, kube3 Unable to mount volumes for pod "alertmanager-alert-0_monitoring(7ec5b71c-12bd-11e9-b6f3-5254008efbd2)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"alertmanager-alert-0". list of unmounted volumes=[config-volume]. list of unattached volumes=[config-volume alertmanager-alert-db default-token-7hg4n]
Taking the secret from here: https://raw.githubusercontent.com/coreos/prometheus-operator/master/contrib/kube-prometheus/manifests/alertmanager-secret.yaml and inserting it into the monitoring namespace as alertmanager-alert
seems to fix the problem:
$ kubectl --kubeconfig=kubeconfig -nmonitoring get po
NAME READY STATUS RESTARTS AGE
alertmanager-alert-0 2/2 Running 0 8m
alertmanager-alert-1 2/2 Running 0 9m
alertmanager-alert-2 2/2 Running 0 8m
grafana-78cb8848f6-wgv95 1/1 Running 0 48m
prometheus-operator-6c4b6cfc76-sqt29 1/1 Running 0 48m
prometheus-prometheus-0 3/3 Running 1 48m
prometheus-prometheus-1 3/3 Running 1 47m
We probably shouldn't be blindly applying Secrets as I have done here... There's supposed to be secret, afterall :lock:. We should probably be generating our own and applying it in the proper places.
"global":
"resolve_timeout": "5m"
"receivers":
- "name": "null"
"route":
"group_by":
- "job"
"group_interval": "5m"
"group_wait": "30s"
"receiver": "null"
"repeat_interval": "12h"
"routes":
- "match":
"alertname": "DeadMansSwitch"
"receiver": "null"
This is the Base64 decode
of the Secret that @JohnStrunk applied.
Applying this secret fixes the issue of alertmanager-alert
being in the ContainerCreating
state but it still doesn't configure the alertmanager to do anything meaningful.
Ideally, the secret applied should contain information about the alertmanager configuration.
global:
resolve_timeout: 5m
route:
group_wait: 10s
group_interval: 10s
repeat_interval: 2m
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: test-alerts
This is the minimal configuration that I use to send alerts to the test-alerts channel on redhat.slack
This being said, there are many ways in which it can be configured and generating secret is dependent on that. https://github.com/prometheus/alertmanager/blob/master/doc/examples/simple.yml https://prometheus.io/docs/alerting/configuration/#configuration-file
So, if we provide our own secret, how do we congifure it in a way that is applicable for common use cases?
I'm not sure we can do better than providing:
- a placeholder config that allows the system to start
- documentation on how to configure it properly
Alert configurations are very environment dependent, and I'm not sure we can even come up w/ a reasonable "demo" setup.
As provided by prometheus-operator[1]. The mixins can hold the config and documentation for now in extras directory.
Moving forward the configurations are being handled by Prometheus operator in OKD. Because monitoring stack is handled by them.
Alert configurations are very environment dependent, and I'm not sure we can even come up w/ a reasonable "demo" setup.
I think that should be possible If we add the config to the mixins and build and apply it via a container.
@JohnStrunk thoughts? [1]. https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus#alertmanager-configuration
I'm not concerned about how to provide the file, it what the file should contain. From briefly looking at the config, it seems concerned with where to push alert notifications. That's the part that I don't think we have a generic answer for. The demo environment doesn't have any "receivers". Am I misunderstanding the problem?
Agree with @JohnStrunk . Default Secret + Documentation to configure is the best way to go.