cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Alertmanager template changes are not fully reloaded

Open locmai opened this issue 1 year ago • 11 comments

Describe the bug With a Cortex helm chart in our Kubernetes cluster - and a sidecar in the alertmanager pod to continuously check the changes from configmap then synchronize the templates in to our /data/fake/templates directory, the template files are updated but the changes are not fully reflected in the messages.

To Reproduce Steps to reproduce the behavior (note: fake is the dummy tenant name):

  1. Start a minimal Cortex with alertmanager and sidecar
  2. Define a template file example.gotmpl (example in additional context)
  3. Let the alertmanager reload the configuration (log message: https://github.com/cortexproject/cortex/blob/9bc04ce3930b045480d72ab9712d3271c70c02ee/pkg/alertmanager/multitenant.go#L689)
  4. Then change the template in the configmap in the changeme part
  5. Let the alertmanager reload the configuration again (same log as step 3)
  6. Check the directory `/data/fake/templates/example.gotmpl' - the file will have the changes
  7. Use amtool to simulate an alert

Expected behavior

The changes would be reflected in the simulated alert sent by the amtool

Actual behavior

The old template is still being used

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: Helm

Additional Context

Alertmanager configuration:

receivers:
  - name: 'team-1'
    slack_configs:
      - channel: '#team1'
        send_resolved: true
        title: '{{ template "__alert_title" . }}'
        text: |-
          Title :{{ template "__alert_title" . }}
templates:
  - 'example.gotmpl'

example.gotmpl :

{{ define "__alert_title" -}}
   {{ .CommonLabels.alertname }} - changeme
{{- end }}

We tried calling the /api/v1/alerts endpoint which gives us the updated template, and the log message indicates that the loadAndSyncConfigs is actually ran.

I've traced through the function from loadAndSyncConfigs -> setConfig where: https://github.com/cortexproject/cortex/blob/9bc04ce3930b045480d72ab9712d3271c70c02ee/pkg/alertmanager/multitenant.go#L861C3-L861C68

this line seems to compare the templates from the loaded/updated cfg with the template in the store (with the templateFilePath) which is already updated via the sidecar's mechanism.

locmai avatar Mar 08 '24 11:03 locmai

Need to take a look and see if we can reproduce this issue.

yeya24 avatar Mar 11 '24 17:03 yeya24

@yeya24 were you able to see this issue? We have a few users trying to make template changes and right now to remediate we have to restart the alertmanager pods. But we have multiple environments so that can be a bit burdensome especially if the template has an issue and needs to be rolled back etc.

dpericaxon avatar Jun 11 '24 23:06 dpericaxon

Hey @rapphil, @rajagopalanand, can you guys maybe help take a look at the issue?

yeya24 avatar Jun 11 '24 23:06 yeya24

Hey @rapphil, @rajagopalanand did you get a chance to take a look at this possibly? Since we have multiple environments we currently have to do a rolling restart in every environment for every template change.

dpericaxon avatar Jun 25 '24 16:06 dpericaxon

Not yet. I will try and find some time to look this week

rajagopalanand avatar Jun 25 '24 16:06 rajagopalanand

I'm taking a look into this issue. Right now I'm trying to reproduce using the helm charts and a local cluster.

rapphil avatar Jul 11 '24 17:07 rapphil

Hi, I was not able to reproduce your issue:

Having said that, here are a couple of questions:

  • when you try to access the alertmanager endpoint /multitenant_alertmanager/configs do you see a correct configuration? is the configuration what you are expecting? This is what I'm getting when running my tests:
fake:
  template_files:
    template.gotmpl: |-
      {{ define "__alert_title" -}}
        {{ .CommonLabels.alertname }} - changeme
      {{- end }}
  alertmanager_config: |-
    route:
      group_wait: 30s
      group_interval: 10s
      receiver: slack-config
    receivers:
    - name: 'slack-config'
      slack_configs:
        - send_resolved: true
          api_url: 'http://echo-server.cortex'
          channel: "#channel1"
          title: '{{ template "__alert_title" . }}'
          text: 'Title :{{ template "__alert_title" . }}'
    templates:
    - 'template.gotmpl'

This is the payload that was passed to the echo server:

{"name":"echo-server","hostname":"echo-server-5fb75ccd64-bqkz5","pid":1,"level":30,"host":{"hostname":"echo-server.cortex","ip":"::ffff:10.244.0.116","ips":[]},"http":{"method":"POST","baseUrl":"","originalUrl":"/","protocol":"http"},"request":{"params":{},"query":{},"cookies":[],"body":{"channel":"#channel1","username":"Alertmanager","attachments":[{"title":"my_alert_confmap - changeme","title_link":"/api/prom/alertmanager/#/alerts?receiver=slack-config","text":"Title :my_alert_confmap - changeme","fallback":"[FIRING:1]  (my_alert_confmap my_instance my_cron_job warning) | /api/prom/alertmanager/#/alerts?receiver=slack-config","callback_id":"","footer":"","color":"danger","mrkdwn_in":["fallback","pretext","text"]}]},"headers":{"host":"echo-server.cortex","user-agent":"Alertmanager/","content-length":"439","content-type":"application/json"}},"msg":"Fri, 19 Jul 2024 20:30:58 GMT | [POST] - http://echo-server.cortex/","time":"2024-07-19T20:30:58.213Z","v":0}

Here is the full configmap that I used for alertmanager:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /data/
  labels:
    cortex_alertmanager: "1"
  name: alertmanager-example-config
  namespace: cortex
data:
  fake.yaml: |-
    route:
      group_wait: 30s
      group_interval: 10s
      receiver: slack-config
    receivers:
    - name: 'slack-config'
      slack_configs:
        - send_resolved: true
          api_url: 'http://echo-server.cortex'
          channel: "#channel1"
          title: '{{ template "__alert_title" . }}'
          text: 'Title :{{ template "__alert_title" . }}'
    templates:
    - 'template.gotmpl'

for the templates:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /data/fake/templates/
  labels:
    cortex_alertmanager: "1"
  name: alertmanager-example-template
  namespace: cortex
data:
  template.gotmpl: |-
   {{ define "__alert_title" -}}
     {{ .CommonLabels.alertname }} - changeme
   {{- end }}

Also notice that the side car is only functional if you use local storage.

rapphil avatar Jul 19 '24 22:07 rapphil

Hey @rapphil , thanks for taking a look at this issue.

Also notice that the side car is only functional if you use local storage.

Yes, we are using the local storage backend for alertmanager. Here is our alertmanager's configuration, it could be quite outdated since we had it from the beginning setup till now:

alertmanager:
  external_url: /api/prom/alertmanager
  enable_api: true
  data_dir: /data/
alertmanager_storage:
  backend: local
  local:
    path: /data

When you try to access the alertmanager endpoint /multitenant_alertmanager/configs do you see a correct configuration? is the configuration what you are expecting?

I tested that previously and it returned the unchanged/non-update configuration. But your test is much simpler, I will try to reproduce the same way and update the result here.

locmai avatar Jul 20 '24 04:07 locmai

Hi, were you ever able to reproduce the results?

rapphil avatar Apr 24 '25 20:04 rapphil

Hi, sorry I didn't have the chance to go back to this, please feel free to close it as it's not urgent for us and we could re-open it when I have the reproduce results.

Thank you for circling back on this too!

locmai avatar Apr 25 '25 09:04 locmai

@yeya24 do you think we can close this issue given the feedback?

rapphil avatar Apr 30 '25 22:04 rapphil