helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[kube-prometheus-stack] Alertmanager does not update secret with custom configuration options

Open everspader opened this issue 3 years ago • 18 comments

Describe the bug a clear and concise description of what the bug is.

I want to customize the alertmanager configuration in the chart. If the alertmanager.config block is passed in the values files when first installing the chart, the alertmanager pod is not created. But if this block is omitted, the pod is created and I can kubectl port-forward into the pod.

However, the configuration file is mounted from what seems to be an automatically generated secret which contains the config block from the chart's default values:

...
  volumeMounts:
   - mountPath: /etc/alertmanager/config
     name: config-volume
...
volumes:
  - name: config-volume
    secret:
      defaultMode: 420
      secretName: alertmanager-prometheus-community-kube-alertmanager-generated

Next, if I try to upgrade the chart to include the configuration, a new secret is created named alertmanager-prometheus-community-kube-alertmanager but the pod has its configuration mounted from the auto generated secret.

What's your helm version?

version.BuildInfo{Version:"v3.7.1", GitCommit:"1d11fcb5d3f3bf00dbe6fe31b8412839a96b3dc4", GitTreeState:"clean", GoVersion:"go1.16.9"}

What's your kubectl version?

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:38:05Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.9", GitCommit:"56709e92afa973c26fad3d4a44723fefa51481b7", GitTreeState:"clean", BuildDate:"2022-03-10T07:59:33Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}

Which chart?

kube-prometheus-stack

What's the chart version?

kube-prometheus-stack-34.9.1

What happened?

The alertmanager pod mounts a configuration file from a secret that contains the default configuration values for the alertmanager and when updated, a new secret is created but this is not mounted into the pod

What you expected to happen?

The secret in which the configuration file in /etc/alertmanger/config/alertmanager.yaml is mounted from should be updated to contain the configuration passed in the values file when upgrading the chart

How to reproduce it?

  1. Install the chart with helm like you normally would (with or without the values file)
  2. Check the auto generated secret: kubectl get secrets -n monitoring
  3. Upgrade the helm chart with the command below to include the configuration block from the values file
  4. Check the secrets again to see that a new secret is created but nothing is changed
  5. Open a shell session in the alertmanager pod to inspect the config file in /etc/alertmanager/config/alertmanager.yaml and see that it contains the default values.

Enter the changed values of values.yaml?

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['job', 'alertname', 'priority']
      group_wait: 10s
      group_interval: 1m
      routes:
      - match:
          alertname: Watchdog
        receiver: 'null'
      - receiver: 'slack-notifications'
        continue: true
    receivers:
    - name: 'slack-notifications'
      slack-configs:
      - slack_api_url: <url here>
        title: '{{ .Status }} ({{ .Alerts.Firing | len }}): {{ .GroupLabels.SortedPairs.Values | join " " }}'
        text: '<!channel> {{ .CommonAnnotations.summary }}'
        channel: '#mychannel'

Enter the command that you execute and failing/misfunctioning.

helm upgrade -i prometheus-community prometheus-community/kube-prometheus-stack -n monitoring -f path/to/values.yaml

Anything else we need to know?

No response

everspader avatar Apr 19 '22 13:04 everspader

+1

cydergoth avatar Apr 25 '22 20:04 cydergoth

+1

kevinat avatar Apr 27 '22 09:04 kevinat

+1

Clusiv avatar May 13 '22 07:05 Clusiv

+1

vladimirshikhov avatar May 13 '22 08:05 vladimirshikhov

Please, tell me the working version of chart

vladimirshikhov avatar May 13 '22 08:05 vladimirshikhov

+1

sinhblue avatar May 24 '22 15:05 sinhblue

This is blocking our production deployment as we can't deploy the alerts we need

cydergoth avatar May 25 '22 18:05 cydergoth

I have found the problem to be a misconfiguration in the alerting rules so the config.y'all can't load properly so alerts are not created. My recommendation is to carefully review the config file

everspader avatar May 25 '22 19:05 everspader

You nailed it - helm was silently mangling the files

#!/bin/bash

helm template -n monitoring monitoring -f values.yaml -f values-ops.yaml . --show-only charts/kube-prometheus-stack/templates/alertmanager/secret.yaml | yq -r '.data."alertmanager.yaml" | @base64d' >am-ops.config helm template -n monitoring monitoring -f values.yaml -f values-prod.yaml -f secret://values-prod-secrets.enc.yaml . --show-only charts/kube-prometheus-stack/templates/alertmanager/secret.yaml | yq -r '.data."alertmanager.yaml" | @base64d' >am-prod.config diff am-prod.config am-ops.config ~/go/bin/amtool config routes test --config.file=am-prod.config --tree ~/go/bin/amtool config routes test --config.file=am-prod.config --tree alertname="TSDB Sync failed: missing WAL file" ~/go/bin/amtool config routes test --config.file=am-ops.config --tree ~/go/bin/amtool config routes test --config.file=am-ops.config --tree alertname="TSDB Sync failed: missing WAL file"

On Wed, May 25, 2022, 2:09 PM Everton Spader @.***> wrote:

I have found the problem to be a misconfiguration in the alerting rules so the config.y'all can't load properly so alerts are not created. My recommendation is to carefully review the config file

— Reply to this email directly, view it on GitHub https://github.com/prometheus-community/helm-charts/issues/1998#issuecomment-1137751253, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPWAWS5W5TBLBVTK6PHID3VLZ3GTANCNFSM5TY2XP4A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

cydergoth avatar May 25 '22 21:05 cydergoth

After spending many hours on this, I got these learnings:

  1. Updating an existing helm deployment will not update the configuration of the alertmanager. You havo to uninstall and reinstall the helm chart.
  2. Using a faulty alertmanager.config field will not create the alertmanager pod. Start with a working config and iterate on that.
  3. You can obtain a working value of alertmanager.config by installing the helm chart without the alertmanager.config set, port-forwarding the alertmanager pod which will be created with a default config and using the displayed config in http://localhost:9093/#/status as your alertmanager.config. (SSHing into the pod and finding it there also works)
  4. From this working setup you can iteratively update the config, uninstall the chart and reinstall the chart. 😅

For this I was using kube-prometheus-stack-36.1.0

maikokuppe avatar Jun 23 '22 13:06 maikokuppe

Still getting this issue on 36.2.0

YuKitsune avatar Jun 26 '22 02:06 YuKitsune

A workaround I've found is to move the config into a custom resource definition, and reference that using alertmanager.alertmanagerSpec.alertmanagerConfiguration.

Example:

# receiver-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alertmanager-config
  namespace: monitoring
spec:
  receivers:
    - name: '<receiver name>'
      webhookConfigs:
        - url: '<webhook url>'

  route:
    receiver: '<receiver name>'
# values.yaml
alertmanager:
  alertmanagerSpec:
    alertmanagerConfiguration:
      name: alertmanager-config
...

This seems to work as expected, but now I wonder how alertmanager.config is even meant to be used.

YuKitsune avatar Jun 26 '22 06:06 YuKitsune

I just had the same problem here and, possibly, found a solution for it. Notice that you use the 'null' receiver at the Watchdog route match, but there is no definition for the null receiver. By adding an empty receiver named 'null', just like the default values does, my config was updated successfully. Here's an example:

alertmanager:
  enabled: true
  config:
    global:
      slack_api_url: <URL>
      resolve_timeout: 5m
    route:
      group_by: ['alertname']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'slack'
      routes:
      - match:
          alertname: Watchdog
        receiver: 'null'
    receivers:
      - name: 'null'
      - name: 'slack'
        slack_configs:
        - channel: '#alerts'
          text: 'https://internal.myorg.net/wiki/alerts/{{ .GroupLabels.app }}/{{ .GroupLabels.alertname }}'

Maybe the real problem here is that there is no error message or feedback that tells your configuration is invalid, or missing some information. The error just goes silent.

renatopereira-gc avatar Jun 30 '22 20:06 renatopereira-gc

+1, I wasn't able to find who was in charge of this secret generation… Having something at UI level or a log to inform us the provided configuration is not correct would be very cool.

davinkevin avatar Jul 16 '22 16:07 davinkevin

Just encountered the issue and apparently it goes something like this: When you create or update an alertmanager config secret, prometheus operator notices it and attempts to validate it using crds, specifically alertmanagerconfigs.monitoring.coreos.com.

If validation succeeds - the generated alermanager secret is updated/created. This is the secret that is actually mounted in the alertmanager pod, not the one you modify with a chart.

If validation fails, prometheus operator writes a console log explaining what it didn't like in your config and does nothing. Somewhat sneaky to my liking, but oh well.

tldr: If you want to know why your alertmanager secret is not upated - check prometheus operator logs for errors.

ps: be aware, that apparently alertmanager config crd is way behind the official documentation and is lacking some fields. For example, telegram_configs was only recently added (in may or june) and crd still misses time_intervals and active_time_intervals objects.

Volkmire avatar Aug 04 '22 12:08 Volkmire

A workaround I've found is to move the config into a custom resource definition, and reference that using alertmanager.alertmanagerSpec.alertmanagerConfiguration.

Example:

# receiver-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alertmanager-config
  namespace: monitoring
spec:
  receivers:
    - name: '<receiver name>'
      webhookConfigs:
        - url: '<webhook url>'

  route:
    receiver: '<receiver name>'
# values.yaml
alertmanager:
  alertmanagerSpec:
    alertmanagerConfiguration:
      name: alertmanager-config
...

This seems to work as expected, but now I wonder how alertmanager.config is even meant to be used.

@YuKitsune would you mind to share your full config? I am getting the following on the logs:

evel=warn ts=2022-08-06T16:06:17.533402697Z caller=operator.go:1091 component=alertmanageroperator msg="skipping alertmanagerconfig" error="unable to get secret \"\": resource name may not be empty" alertmanagerconfig=monitoring/alertmanager-config-override namespace=monitoring alertmanager=promstack-alertmanager

This is my AlertmanagerConfig:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alertmanager-config-override
  namespace: monitoring
spec:
    route:
      groupWait: 30s
      groupInterval: 5m
      repeatInterval: 12h
      receiver: 'null'
      routes:
      - receiver: 'null'
        matchers:
        - name: alertname
          matchType: '=~'
          value: "InfoInhibitor|Watchdog"
    receivers:
    - name: 'null'

and my values.yml:

alertmanager:
  enabled: true
  config:
    global:
      resolve_timeout: 5m
    templates:
    - '/etc/alertmanager/config/*.tmpl'
  alertmanagerSpec:
    alertmanagerConfiguration:
      name: alertmanager-config-override

jsalatiel avatar Aug 06 '22 16:08 jsalatiel

@jsalatiel That should work. What I've found is that you need to install the chart first so that he CRDs get added, apply the custom AlertmanagerConfig, then re-install/upgrade the helm chart so it picks up the custom AlertmanagerConfig.

I'm still relatively new to Helm and K8s, so I might be doing it in a roundabout way, but that's what I've found works...

YuKitsune avatar Aug 07 '22 03:08 YuKitsune

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] avatar Sep 16 '22 01:09 stale[bot]

This issue is being automatically closed due to inactivity.

stale[bot] avatar Oct 12 '22 10:10 stale[bot]

Had same issue, turned out to get fixed by deleting alertmanager pod and all the values were added automatic to alertmanager.config

quasimodo-r avatar Jan 20 '23 12:01 quasimodo-r

Had same issue, turned out to get fixed by deleting alertmanager pod and all the values were added automatic to alertmanager.config

Can you share your configuration? I deleted the pod and did not update the configuration of the pod alertmanager.

pilchita avatar May 21 '23 23:05 pilchita

Had same issue, turned out to get fixed by deleting alertmanager pod and all the values were added automatic to alertmanager.config

Can you share your configuration? I deleted the pod and did not update the configuration of the pod alertmanager.

Unfortunately dont have it anymore.

rummu666 avatar May 22 '23 05:05 rummu666

Could you please reopen this ticket ? I have the same issue today.

yuriifurko avatar Sep 25 '23 09:09 yuriifurko

We're encountering this issue repeatedly, would be nice if there was a better solution than "Uninstall and reinstall the chart" which in production is a tad heavy-handed...

noahlz avatar Nov 09 '23 16:11 noahlz

The core of the issue is: when creating the chart, the secret is created as a pre-hook resource, and is therefore not part of the helm chart proper. Therefore if you run upgrade on your chart, it won't try to recreate the secret..

However, if you delete the secret and run upgrade on your chart, the secret will get recreated with the proper values. To be clear, this is quite annoying 😢

While there are other workarounds (such as using an external secret), I wouldn't consider this issue closed, as it is very much present 😅

SamuZad avatar Nov 16 '23 17:11 SamuZad