terraform-provider-grafana
terraform-provider-grafana copied to clipboard
Easy to comprehend template for "Alert rule group"
Hello Team,
It would be great to have easy-to-comprehend templates for creating alert rules, something similar to yaml contents https://registry.terraform.io/providers/inuits/cortex/latest/docs/resources/rules. Or maybe something like this:
- name: myalert rule folder: group_name: query 1: expr: datasource: range: query 2: expr: datasource: range: condition: range: evaluator Annotations: description: summary: labels: key: value
That makes a little less overhead in understanding what the alert rule is from the manifest itself.
This is a wider effort that we are tracking internally, and has existed for some time. This isn't just purely a terraform thing - same goes for the .yaml provisioning, and the API itself.
Really, it's just that grafana's representation of an alert rule is really large. This is the result of a trade-off, it increases in size because of its flexibility, as it can query any arbitrary datasource.
One single model has difficulty covering all cases, as not every datasource is built around a query string. Consider the Cloudwatch/Stackdriver datasources, there's not a single query field but rather the result of a number of drop-downs.
What we are looking into is how we can have very simple, targeted rule definitions, but specific to some common datasource types. Users who need the flexibility can then fall back to the generic struct we have now. But, this effort spans a few different systems including Terraform, so it's not quite there yet.
We face the same issue as we starting to use Terraform to manage our alerts now, we have multi datasources (gcp/aws/bigquery/prometheus) with more than 150 alerts.
Coding the alert directly with the grafana_rule_group resource was impossible (nearly 200 lines/alert), so we have created multiple modules to simplify the alert creation. We have one module per model of datasource query + one module for the grafana expression + one module per datasource (which aggregate the others modules).
It was difficult to write and there are lot of complexity (because of the mutliple modules) but at least the usage are simple and reduce to the strict minimum.
Maybe the first solution to implement to help grafana users with Terraform is to provide the model for the query of each datasource ... is actually a pain to discover the model and understand it, because nothing is documented in Grafana ...
@Eraac Would you like please to share an example of how to use grafana_rule_group with CloudWatch datasource. Thanks
Sure @obounaim the model for the CloudWatch query look like this, this can be utilized inside the model attribute https://registry.terraform.io/providers/grafana/grafana/latest/docs/resources/rule_group#model
JSON
locals {
model = {
refId = var.ref_id,
intervalMs = coalesce(var.interval_milliseconds, 1000)
maxDataPoints = coalesce(var.max_data_points, 43200)
alias = var.alias,
dimensions = var.dimensions,
expression = var.expression,
id = var.id,
matchExact = coalesce(var.match_exact, true),
metricName = var.metric_name,
namespace = var.namespace,
period = var.period,
region = var.region,
statistic = coalesce(var.statistic, "Average"),
logGroupNames = var.log_group_names,
metricEditorMode = var.metric_editor_mode,
metricQueryType = var.metric_query_type,
queryMode = coalesce(var.query_mode, "Metrics"),
sql = var.sql,
sqlExpression = var.sql_expression,
# statsGroups :shrug: -> can figure out the usage from the interface
}
}
variable.tf
variable "ref_id" {
description = "Reference name for the query"
type = string
default = "A"
}
variable "interval_milliseconds" {
description = "Number of milliseconds between each points of the timeserie. Use period instead, this attribute is only here because Grafana set it"
type = number
default = null
}
variable "max_data_points" {
description = "Maximun number of points for the timeseries"
type = number
default = null
}
variable "alias" {
# https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/graph-dynamic-labels.html
description = "Change time series legend name using Dynamic labels. See documentation for details"
type = string
default = null
}
variable "dimensions" {
description = "A dimension is a name/value pair that is part of the identity of a metric"
type = map(string)
default = {}
}
variable "expression" {
# https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-search-expressions.html
description = "Search expressions are a type of math expression that you can add to CloudWatch graphs. Used by search metrics or logs"
type = string
default = null
}
variable "id" {
description = "ID can be used to reference other queries in math expressions"
type = string
default = null
}
variable "match_exact" {
description = "Only show metrics that exactly match all defined dimensions names"
type = bool
default = null
}
variable "metric_name" {
# https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/viewing_metrics_with_cloudwatch.html
description = "Name of the metric to retrieve"
type = string
default = null
}
variable "namespace" {
description = "A namespace is a container for CloudWatch metrics. Metrics in different namespaces are isolated from each other, so that metrics from different applications are not mistakenly aggregated into the same statistics"
type = string
default = null
}
variable "period" {
description = "Minimal interval between two points in seconds"
type = string
default = null
}
variable "region" {
description = "Region to call for CloudWatch"
type = string
default = null
}
variable "statistic" {
# https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Statistic
# https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Statistics-definitions.html
description = "Statistics are metric data aggregations over specified periods of time"
type = string
# Average, Sum, Minimum, Maximum, SampleCount
default = null
}
variable "metric_editor_mode" {
description = "Determine the editor mode (builder or code)"
type = number
default = null # values: 0 -> builder | 1 -> code
}
variable "sql_expression" {
description = "Raw SQL expression to pass to CloudWatch to retrieve the timeseries. Don't forget to set 'metric_editor_mode' to 1"
type = string
default = null
}
variable "metric_query_type" {
# https://grafana.com/docs/grafana/latest/datasources/aws-cloudwatch/#metrics-query-editor
description = "The type of query to build"
type = number
# Metrics Query in the CloudWatch plugin is what is referred to as Metric Insights in the AWS console
default = null # values: 0 -> metric search | 1 -> metric query
}
variable "query_mode" {
description = "Determine if we query cloudwatch metrics or cloudwatch logs"
type = string
default = null # values: "Metrics", "Logs"
}
variable "log_group_names" {
description = "Name of the logs group to read from"
type = list(string)
default = null
}
variable "sql" {
description = "Same as sql_expression, but for the builder. Use sql_expression instead"
type = object({}) # structure is too difficult
default = null
}
output.tf
output "model" {
value = jsonencode(local.model)
}
output "ref_id" {
value = var.ref_id
}
Thanks @Eraac It seems to be working however it seems that the conditions is missing of type "expression". I tried to find the json systax for it however I was not able to in the Grafana documentation.
@obounaim indeed, here the module we have made for handling the expression model
JSON
locals {
model = {
type = coalesce(var.type, "classic_conditions"),
refId = var.ref_id,
intervalMs = coalesce(var.interval_milliseconds, 1000)
maxDataPoints = coalesce(var.max_data_points, 43200)
# math, reduce, resample
expression = var.expression,
# reduce
reducer = var.reducer,
settings = var.type != "reduce" ? null : {
# for strict mode, the mode is empty string
mode = coalesce(var.reduce_mode, "strict") == "strict" ? "" : var.reduce_mode,
}
# resample
downsampler = var.down_sampler
upsampler = var.up_sampler
window = var.window
# classic_conditions
conditions = [
for v in var.conditions : {
evaluator = {
params = v.evaluator_params,
type = v.evaluator_type,
},
operator = {
type = v.operator_type,
},
query = {
params = [v.query_ref_id_target],
}
reducer = {
type = v.reducer_type,
}
}
]
}
}
variable.tf
variable "type" {
description = "Type of the query (classic_conditions, math, reduce, resample)"
type = string
default = "classic_conditions" # classic_conditions, math, reduce, resample
}
variable "ref_id" {
description = "Name of the query"
type = string
default = "Z"
}
variable "interval_milliseconds" {
description = "Number of milliseconds between each points of the timeserie. Use period instead, this attribute is only here because Grafana set it"
type = number
default = null
}
variable "max_data_points" {
description = "Maximun number of points for the timeseries"
type = number
default = null
}
variable "expression" {
description = "Must be the ref_id of the input for reduce and resample, for math is the formula"
type = string
default = null
}
variable "reducer" {
description = "The function to apply on time series to reduce it (mean, min, max, sum, count, last)"
type = string
default = null # values: "mean", "min", "max", "sum", "count", "last"
}
variable "reduce_mode" {
description = "strict: Result can be NaN if series contains non-numeric data | dropNN: Drop NaN, +/- and null from input series before reducing | replaceNN: Replace NaN, +/-Inf and null with a constant before reducing (variable 'reduce_replace_with')"
type = string
default = null # values: "dropNN", "replaceNN", "strict"
}
variable "reduce_replace_with" {
description = "When reduce_mode is 'replaceNN', use this value to replace all the NaN, +/-Inf and null values"
type = number
default = null
}
variable "down_sampler" {
description = "The reduction function to use when there are more than one data point per window sample (min, max, mean, sum)"
type = string
default = null # values: min, max, mean, sum
}
variable "up_sampler" {
description = "The method to use to fill a window sample that has no data points. pad: fills with the last know value | backfill: with next known value | fillna: to fill empty sample windows with NaNs"
type = string
default = null # values: pad, backfilling, fillna
}
variable "window" {
description = "The duration of time to resample to, for example 10s. Units may be s seconds, m for minutes, h for hours, d for days, w for weeks, and y of years"
type = string
default = null
}
variable "conditions" {
description = "List of conditions to fire the alert"
type = list(object({
evaluator_params = list(number), # 1 param for lt/gt and 2 params for outside_range/within_range, 0 for no_value
evaluator_type = string, # gt, lt, outside_range, within_range, no_value
operator_type = optional(string, "and"), # for multiple conditions
query_ref_id_target = optional(string, "A"),
reducer_type = string, # sum, min, max, count, last, median, avg, count_non_null, diff, diff_abs, percent_diff, percent_diff_abs
}))
default = null
}
output.tf
output "model" {
value = jsonencode(local.model)
}
output "ref_id" {
value = var.ref_id
}
Thanks @Eraac it works great. One more question that is maybe out of the scoop of this issue.
Is there a way to create "rule" argument in "grafana_rule_group" resource automatically using a loop like for_each ? I am aware the meta argument for_each applies to resouce, are you aware of something similar that can be used for an argument ?
example :
resource "grafana_rule_group" "my_alert_rule" {
name = "My Rule Group"
folder_uid = grafana_folder.rule_folder.uid
interval_seconds = 240
org_id = 1
for_each = toset( ["rule1", "rule2", "rule3", "rule4"] )
rule {
name = each.key
```}
}
@obounaim https://developer.hashicorp.com/terraform/language/expressions/dynamic-blocks
I understand the trade-offs mentioned by @alexweav. It is really hard to maintain the Terraform-native definitions for every single supported data source, that, in the meantime, may change according to their evolution pace.
On practice though, you do not need that tons of supported data sources, you use just few. It seems totally possible that you create parameters.tf file with a summary of what makes every alert rule unique, while templating into the proper format (including JSON) on the final stage. The structure can be as @sammit20 suggested or simpler/harder depending on your needs. You write this thing once and then reuse it for every alert rule. It is impossible though to create an ideal solution for all, and every user must do it on their own based on what make sense for them.