alertmanager
alertmanager copied to clipboard
Add basic arithmetic functions to templating funcmap
Why
I link to grafana dashboards from my prometheus alerts, using some alert labels as grafana variables in the URL to narrow down the dashboard queries. I would like to link to a specific range (eg: 30 mins before/after) around StartsAt
and/or EndsAt
instead of having to adjust the timeline after opening the link, but this requires basic arithmetic functions (ie: add/sub). In addition, grafana URL params expect unix timestamps in milliseconds, making StartsAt
and EndsAt
currently unusable, forcing me to return to the alert and manually translate the times from it to grafana via a date picker.
Proposal
Add the following self-explanatory arithmetic functions to DefaultFuncs
in https://github.com/prometheus/alertmanager/blob/master/template/template.go
-
add
-
sub
-
div
-
mul
From a user perspective, I don't want to worry about types here, so these functions will require type assertions to differentiate between floats and ints (and maybe strconv.Atoi
if string, but I'm not sold on the usefulness of that). Still, this should be relatively simple to implement.
Add the following two functions to allow the use of the above arithmetic functions to manipulate the timestamps in StartsAt
and EndsAt
:
-
toUnix
-
fromUnix
This is similar to #603 but that request has other concerns about manipulating/accessing current dates, sorting lists etc, so I thought it worth separating the two.
Anything like this should be done down in alert templates in Prometheus.
The problem with putting this in Prometheus alert templates is that it will cause a huge amount of duplication in this case.
To solve my given example with my proposal I would need to add something like the following to a single shared template in alertmanager:
{{ $url := (printf "%s?from=%s" $url (.StartsAt | toUnix | sub 1800000)) }}
{{ $url := (printf "%s&to=%s" $url (.EndsAt | toUnix | add 1800000)) }}
That I can then include wherever needed with {{ template "grafana.link.href.partial" . }}
.
To achieve the same if these functions instead existed in Prometheus alert templates would require adding the following to every single rule:
annotations:
start_unix: {{ .StartsAt | toUnix | sub 1800000 }}
end_unix: {{ .EndsAt | toUnix | add 1800000) }}
As well as having a line in the alertmanager template to extract the annotation anyway.
Additionally, timestamp data seemingly isn't exposed to Prometheus alert templates. Unless I'm missing something, only labels and the raw sample value are exposed: https://github.com/prometheus/prometheus/blob/master/rules/alerting.go#L196-L202
Even if this timestamp data were to be exposed to the template here it wouldn't be the ActiveAt
timestamp (same as StartsAt
in alertmanager?), which is the useful one for this example.
The StartsAt and EndsAt aren't exactly reliable, and may be zero depending on the current state of the alert. They're more an implementation detail than anything.
Usually you also want context on an alert, not merely when it get bad enough to start firing. What I'd suggest is creating links to Grafana with fixed parameters such as &from=now-6h&to=now
or rely on the defaults for the dashboard which (presumably) have an appropriate value for the time range already.
Ugly workaround for now:
groups:
- name: testalert
rules:
- record: grafanaFrom
expr: vector((time() - (30*60))*1000)
- record: grafanaTo
expr: vector((time() + (30*60))*1000)
- alert: IgnoreAlert
expr: vector(1)
for: 10s
labels:
severity: major
grafana: "http://grafana.board.local?{{ printf \"from=%.0f&to=%.0f\" (query \"grafanaFrom\" | first | value) (query \"grafanaTo\" | first | value) }}"
annotations:
summary: Daily alert test summary
description: Daily alert test description
Note that the grafana link should be an annotation and not a label (see https://github.com/prometheus/prometheus/issues/4652 for the details).
The StartsAt and EndsAt aren't exactly reliable, and may be zero depending on the current state of the alert. They're more an implementation detail than anything.
Usually you also want context on an alert, not merely when it get bad enough to start firing. What I'd suggest is creating links to Grafana with fixed parameters such as
&from=now-6h&to=now
or rely on the defaults for the dashboard which (presumably) have an appropriate value for the time range already.
I would argue that being able to produce a graph attached to an alert with the timeframe the alert occurred as opposed to (now-6h to now) would be ideal for gathering data and graphs to prepare for postmortems. It seems like it would be very beneficial.
Ideally, one would do something like this in the alert template:
https://grafana.url:xxx/dashboard?var-pod_name={{ .Labels.pod_name }}&from={{ .StartsAt | UnixDate }}-15m&to={{ .EndsAt | UnixDate }}
Any updates? I tried to put Splunk and Grafana links to Splunk alert template with timestamps. I still haven't found a good solution.
IMHO put relative links like now-6h to now is bad practice. Sometime you'd like to use this link after some time, for example after the weekends.
As of now closest solution is use:
{{ with query "time()" }}{{ . | first | value | printf "%.0f"}}{{ end }}
Any updates? I tried to put Splunk and Grafana links to Splunk alert template with timestamps. I still haven't found a good solution. IMHO put relative links like now-6h to now is bad practice. Sometime you'd like to use this link after some time, for example after the weekends. As of now closest solution is use:
{{ with query "time()" }}{{ . | first | value | printf "%.0f"}}{{ end }}
Do you have any idea how to trim the whitespace at the begin and end of the timestamp?
I found this thread while searching for grafana timerange but in alertmanager templates. This is my solution, maybe it helps someone else:
&time={{- (index .Alerts 0).StartsAt.Unix -}}000&time.window=600000
Guys I needed StartsAt - 10m
This
{{ (.StartsAt.Add -600000000000 ).Unix }}000
Did the trick for me with Grafana.
Logs: <{{ $.ExternalURL }}/explore?orgId=1&left=%5B%22{{ (.StartsAt.Add -600000000000 ).Unix }}000%22,%22{{if eq .Status "firing" }}now{{ else }}{{ .EndsAt.Unix }}000{{ end }}%22,%22Loki%22,%7B%22expr%22:%22{{ urlquery .Annotations.logsExpr | reReplaceAll "\+" "%20" | reReplaceAll "%5C" "%5C%5C" | reReplaceAll "%22" "%5C%22" }}%22%7D%5D|:chart_with_upwards_trend: Graph>
I also use {{.StartsAt.Add -600000000000.Unix}}000
.
I think we can close this issue.
I think it makes sense to document the workaround in the official documentation as it isn't obvious for most people.
I have spent many hours finding this issue and workarounds. So I think definitely that official docs should be updated with examples and tips. Linking to the Grafana dashboard with time range is crucial. But what would be even better is to have variables and functions to support this functionality.
Thanks to all contributing with useful workarounds!
There's still no basic math available, though.
There are no integer or decimal fields in the template data as far as I can tell, so in what situations would having Math functions be useful? (template.go#L296-L317)
@grobinson-grafana there is {{ $value }}
, take for example kube_job_status_start_time
and you could use that value (unix ts) to generate a link to logs with sensible timestamp bounds
@nikita2206 There is $value
in Prometheus. However, this issue is talking about Alertmanager, and there is no $value
in Alertmanager as far as I know?
@grobinson-grafana To be more specific, here is my use case: (including the workaround)
- alert: KubeCronJobFailing2Hours
expr: |
(kube_job_failed{condition="true"} > 0)
* on (job_name) group_right ()
label_replace(kube_job_owner{owner_kind="CronJob"}, "cronjob", "$0", "owner_name", ".*")
* on (job_name) group_left ()
kube_job_status_start_time
unless on (cronjob)
label_replace(
present_over_time(kube_job_status_completion_time[2h]),
"cronjob", "$1", "job_name", "^(.+)-\\d+$")
annotations:
type: Job
cronjob: "{{ $labels.cronjob }}"
message: >
CronJob `{{ $labels.cronjob }}` is failing and hasn't completed successfully for at least 2 hours,
last attempt was at {{ $value | humanizeTimestamp }},
<https://logs-backend.internal/logs?filter=trace-id%3D%27{{ $labels.job_name }}%27&startTime={{ (printf "vector(%f - 10)" .Value) | query | first | value | printf "%.0f" }}&endTime={{ (printf "vector(%f + 1800)" .Value) | query | first | value | printf "%.0f" }}|logs here>.
As you can see, I would like to include a link to the logs, which needs time bounds. Sensible time bounds, given that the start timestamp of the Job is known, would be something like '10 seconds before the job started' until '30 minutes after the job started'.