alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Add parseDuration function to templates

Open verdel opened this issue 1 year ago • 9 comments

I add a duration function. This would, for example, allow generating links to a grafana dashboards where a time range for display can be passed.

We will pass a time offset in the annotation and use it to generate the link in the following way:

Alert data:

"annotations": {
        "title": "Alert Summary text",
        "description": "Detailed alert description",
        "log_url": "https://grafana.example.com ...from=%from%, to=%to%",
        "grafana_dashboard_range": "-2m"
      },

Alert template:

{{ define "test" }}
{{ .Annotations.message }}
{{- "\\n" -}}
{{ (index .Alerts 0).Annotations.log_url | reReplaceAll "%from%" (printf "%d000" ((index .Alerts 0).StartsAt.Add (duration (index .Alerts 0).Annotations.grafana_dashboard_range)).Unix) | reReplaceAll "%to%" (printf "%d000" (index .Alerts 0).StartsAt.Unix)}}
{{ end }}

The function itself is very simple:

// duration returns the time.Duration representation of a passed string
"duration": func(text string) (time.Duration, error) {
    d, err := time.ParseDuration(text)
    if err != nil {
        return 0, err
    }
    return d, nil
}

verdel avatar Apr 22 '24 10:04 verdel

I have one question about the intended use case. Wouldn't it be better to have the log_url annotation have the correct URL before the alert is sent to Alertmanager. Then the URL works in visualization tools like Grafana and karma.

For example, instead of this:

"annotations": {
        "title": "Alert Summary text",
        "description": "Detailed alert description",
        "log_url": "https://grafana.example.com ...from=%from%, to=%to%",
        "grafana_dashboard_range": "-2m"
      },

Have this:

"annotations": {
        "title": "Alert Summary text",
        "description": "Detailed alert description",
        "log_url": "https://grafana.example.com ...from=now, to=-2h"
      },

grobinson-grafana avatar Apr 22 '24 15:04 grobinson-grafana

We use Alertmanager not only with Prometheus but also with Loki, and in LogQL, there are no functions to get the current timestamp. There is an open issue describing such a case.

Therefore, we cannot form values for from and to on the Loki Ruler side. Currently, we generate a link to the grafana dashboard in a way similar to what I described in the example. However, since there is no function to work with duration in Alertmanager, we have to set the argument for .StartsAt.Add statically.

If we add a duration function, we will be able to pass the range through alert annotation.

verdel avatar Apr 22 '24 15:04 verdel

OK! 👍

The reason I asked is I was looking at this example:

{{ (index .Alerts 0).Annotations.log_url | reReplaceAll "%from%" (printf "%d000" ((index .Alerts 0).StartsAt.Add (duration (index .Alerts 0).Annotations.grafana_dashboard_range)).Unix) | reReplaceAll "%to%" (printf "%d000" (index .Alerts 0).StartsAt.Unix)}}

and made the following observations:

  1. This is a complicated template, and I suspect there are other users who want to solve the same problem. I think this is a neat "hack" but is there a better way to solve this?
  2. If the grafana_dashboard_range annotation is missing for at least one alert then the entire template will fail and a notification will not be sent. You need to remember to add if statements or add a safe function like duration_or_default.
  3. Most requests I've seen for a duration function so far have been to take a time and print the time since (in other words – the opposite behavior). For example:
{{ .StartsAt|duration }}

Would print:

3 hours, 2 minutes and 1 seconds ago

grobinson-grafana avatar Apr 22 '24 15:04 grobinson-grafana

This is a complicated template, and I suspect there are other users who want to solve the same problem. I think this is a neat "hack" but is there a better way to solve this?

I couldn't find another solution in the case with Loki that would allow getting the timestamp of the moment when the alert-triggering logic is activated, as well as a way to perform operations with this timestamp (adding or subtracting time intervals to get a range).

The only "hack" way is the fact that .StartsAt is of type time.Time and allows the use of corresponding methods within the template. However, for operations with time, we often need to use the time.Duration type, and all the data that comes into the template for processing are "string".

If the grafana_dashboard_range annotation is missing for at least one alert then the entire template will fail and a notification will not be sent. You need to remember to add if statements or add a safe function like duration_or_default.

This is a simplified version of the template, and the responsibility for preparing data in custom templates rests with the end user.

Most often, we use this approach in the template for creating buttons in messages sent to Slack. If there is an error in forming the URL, the button simply will not be displayed in the message.

Most requests I've seen for a duration function so far have been to take a time and print the time since (in other words – the opposite behavior)

I can change the name of the function to eliminate ambiguous interpretation of its functionality. For example, name it parseDuration.

verdel avatar Apr 22 '24 16:04 verdel

I just want to make sure there is no-misunderstanding here too. I'm not against adding a function to parse time durations to do operations on StartsAt and EndsAt. I think we can call it something else like parseDuration as you suggested at the end of your comment.

However, I also think Prometheus, Mimir and Loki should make the StartsAt time and related functions available in annotation templates. This would allow you to template the annotation outside of the Alertmanager which is the correct way to do it for the following reasons:

  1. The correct link is shown when the alert is viewed in Alertmanager UI, Grafana, and other UIs like karma.
  2. I think your solution to this problem is clever but I want to avoid other users from having to come up with the same "hack". The reason I consider it a "hack" is you are basically using the annotation as a template to be templated again in the Alertmanager using regular expressions. This is what I want to avoid by making sure users have the necessary features in the right places.

I appreciate that you cannot do this right now because those features are missing, and why you're having to do it in the notification template instead.

I think we should create an issue in prometheus/prometheus for StartsAt and related time functions to be made available in annotation templates.

@gotjosh what do you think?

grobinson-grafana avatar Apr 22 '24 16:04 grobinson-grafana

I agree that splitting the logic of forming alert messages and moving part of it to Alertmanager is not the best idea.

I would prefer to have this functionality available on the Prometheus or Loki side. Therefore, I will definitely support the idea of creating an issue in the repositories of these products.

While this functionality is not available on the Prometheus or Loki side, I would still request that my custom function be accepted into the code that handles template processing in Alertmanager. I have renamed the function to parseDuration.

verdel avatar Apr 22 '24 16:04 verdel

Yeah I asked @gotjosh for a second opinion as he is an official maintainer and I'm not 🙂

grobinson-grafana avatar Apr 22 '24 16:04 grobinson-grafana

@gotjosh, @grobinson-grafana, can I help with the review process of this PR in any way?

verdel avatar Jul 02 '24 15:07 verdel