terraform-provider-grafana icon indicating copy to clipboard operation
terraform-provider-grafana copied to clipboard

Easy to comprehend template for "Alert rule group"

Open sammit20 opened this issue 3 years ago • 9 comments

Hello Team,

It would be great to have easy-to-comprehend templates for creating alert rules, something similar to yaml contents https://registry.terraform.io/providers/inuits/cortex/latest/docs/resources/rules. Or maybe something like this:

  • name: myalert rule folder: group_name: query 1: expr: datasource: range: query 2: expr: datasource: range: condition: range: evaluator Annotations: description: summary: labels: key: value

That makes a little less overhead in understanding what the alert rule is from the manifest itself.

sammit20 avatar Sep 26 '22 11:09 sammit20

This is a wider effort that we are tracking internally, and has existed for some time. This isn't just purely a terraform thing - same goes for the .yaml provisioning, and the API itself.

Really, it's just that grafana's representation of an alert rule is really large. This is the result of a trade-off, it increases in size because of its flexibility, as it can query any arbitrary datasource.

One single model has difficulty covering all cases, as not every datasource is built around a query string. Consider the Cloudwatch/Stackdriver datasources, there's not a single query field but rather the result of a number of drop-downs.

What we are looking into is how we can have very simple, targeted rule definitions, but specific to some common datasource types. Users who need the flexibility can then fall back to the generic struct we have now. But, this effort spans a few different systems including Terraform, so it's not quite there yet.

alexweav avatar Sep 30 '22 21:09 alexweav

We face the same issue as we starting to use Terraform to manage our alerts now, we have multi datasources (gcp/aws/bigquery/prometheus) with more than 150 alerts.

Coding the alert directly with the grafana_rule_group resource was impossible (nearly 200 lines/alert), so we have created multiple modules to simplify the alert creation. We have one module per model of datasource query + one module for the grafana expression + one module per datasource (which aggregate the others modules).

It was difficult to write and there are lot of complexity (because of the mutliple modules) but at least the usage are simple and reduce to the strict minimum.

Maybe the first solution to implement to help grafana users with Terraform is to provide the model for the query of each datasource ... is actually a pain to discover the model and understand it, because nothing is documented in Grafana ...

eraac avatar Oct 14 '22 09:10 eraac

@Eraac Would you like please to share an example of how to use grafana_rule_group with CloudWatch datasource. Thanks

obounaim avatar Oct 25 '22 15:10 obounaim

Sure @obounaim the model for the CloudWatch query look like this, this can be utilized inside the model attribute https://registry.terraform.io/providers/grafana/grafana/latest/docs/resources/rule_group#model

JSON
locals {
  model = {
    refId = var.ref_id,

    intervalMs    = coalesce(var.interval_milliseconds, 1000)
    maxDataPoints = coalesce(var.max_data_points, 43200)

    alias            = var.alias,
    dimensions       = var.dimensions,
    expression       = var.expression,
    id               = var.id,
    matchExact       = coalesce(var.match_exact, true),
    metricName       = var.metric_name,
    namespace        = var.namespace,
    period           = var.period,
    region           = var.region,
    statistic        = coalesce(var.statistic, "Average"),
    logGroupNames    = var.log_group_names,
    metricEditorMode = var.metric_editor_mode,
    metricQueryType  = var.metric_query_type,
    queryMode        = coalesce(var.query_mode, "Metrics"),
    sql              = var.sql,
    sqlExpression    = var.sql_expression,
    # statsGroups :shrug: -> can figure out the usage from the interface
  }
}
variable.tf
variable "ref_id" {
  description = "Reference name for the query"
  type        = string

  default = "A"
}

variable "interval_milliseconds" {
  description = "Number of milliseconds between each points of the timeserie. Use period instead, this attribute is only here because Grafana set it"
  type        = number

  default = null
}

variable "max_data_points" {
  description = "Maximun number of points for the timeseries"
  type        = number

  default = null
}

variable "alias" {
  # https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/graph-dynamic-labels.html
  description = "Change time series legend name using Dynamic labels. See documentation for details"
  type        = string

  default = null
}

variable "dimensions" {
  description = "A dimension is a name/value pair that is part of the identity of a metric"
  type        = map(string)

  default = {}
}

variable "expression" {
  # https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-search-expressions.html
  description = "Search expressions are a type of math expression that you can add to CloudWatch graphs. Used by search metrics or logs"
  type        = string

  default = null
}

variable "id" {
  description = "ID can be used to reference other queries in math expressions"
  type        = string

  default = null
}

variable "match_exact" {
  description = "Only show metrics that exactly match all defined dimensions names"
  type        = bool

  default = null
}

variable "metric_name" {
  # https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/viewing_metrics_with_cloudwatch.html
  description = "Name of the metric to retrieve"
  type        = string

  default = null
}

variable "namespace" {
  description = "A namespace is a container for CloudWatch metrics. Metrics in different namespaces are isolated from each other, so that metrics from different applications are not mistakenly aggregated into the same statistics"
  type        = string

  default = null
}

variable "period" {
  description = "Minimal interval between two points in seconds"
  type        = string

  default = null
}

variable "region" {
  description = "Region to call for CloudWatch"
  type        = string

  default = null
}

variable "statistic" {
  # https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Statistic
  # https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Statistics-definitions.html
  description = "Statistics are metric data aggregations over specified periods of time"
  type        = string

  # Average, Sum, Minimum, Maximum, SampleCount
  default = null
}

variable "metric_editor_mode" {
  description = "Determine the editor mode (builder or code)"
  type        = number

  default = null # values: 0 -> builder | 1 -> code
}

variable "sql_expression" {
  description = "Raw SQL expression to pass to CloudWatch to retrieve the timeseries. Don't forget to set 'metric_editor_mode' to 1"
  type        = string

  default = null
}

variable "metric_query_type" {
  # https://grafana.com/docs/grafana/latest/datasources/aws-cloudwatch/#metrics-query-editor
  description = "The type of query to build"
  type        = number

  # Metrics Query in the CloudWatch plugin is what is referred to as Metric Insights in the AWS console
  default = null # values: 0 -> metric search | 1 -> metric query
}

variable "query_mode" {
  description = "Determine if we query cloudwatch metrics or cloudwatch logs"
  type        = string

  default = null # values: "Metrics", "Logs"
}

variable "log_group_names" {
  description = "Name of the logs group to read from"
  type        = list(string)

  default = null
}

variable "sql" {
  description = "Same as sql_expression, but for the builder. Use sql_expression instead"
  type        = object({}) # structure is too difficult

  default = null
}
output.tf
output "model" {
  value = jsonencode(local.model)
}

output "ref_id" {
  value = var.ref_id
}

eraac avatar Oct 25 '22 21:10 eraac

Thanks @Eraac It seems to be working however it seems that the conditions is missing of type "expression". I tried to find the json systax for it however I was not able to in the Grafana documentation.

obounaim avatar Oct 26 '22 10:10 obounaim

@obounaim indeed, here the module we have made for handling the expression model

JSON
locals {
  model = {
    type  = coalesce(var.type, "classic_conditions"),
    refId = var.ref_id,

    intervalMs    = coalesce(var.interval_milliseconds, 1000)
    maxDataPoints = coalesce(var.max_data_points, 43200)

    # math, reduce, resample
    expression = var.expression,

    # reduce
    reducer = var.reducer,
    settings = var.type != "reduce" ? null : {
      # for strict mode, the mode is empty string
      mode = coalesce(var.reduce_mode, "strict") == "strict" ? "" : var.reduce_mode,
    }

    # resample
    downsampler = var.down_sampler
    upsampler   = var.up_sampler
    window      = var.window

    # classic_conditions
    conditions = [
      for v in var.conditions : {
        evaluator = {
          params = v.evaluator_params,
          type   = v.evaluator_type,
        },

        operator = {
          type = v.operator_type,
        },

        query = {
          params = [v.query_ref_id_target],
        }

        reducer = {
          type = v.reducer_type,
        }
      }
    ]
  }
}
variable.tf
variable "type" {
  description = "Type of the query (classic_conditions, math, reduce, resample)"
  type        = string

  default = "classic_conditions" # classic_conditions, math, reduce, resample
}

variable "ref_id" {
  description = "Name of the query"
  type        = string

  default = "Z"
}

variable "interval_milliseconds" {
  description = "Number of milliseconds between each points of the timeserie. Use period instead, this attribute is only here because Grafana set it"
  type        = number

  default = null
}

variable "max_data_points" {
  description = "Maximun number of points for the timeseries"
  type        = number

  default = null
}

variable "expression" {
  description = "Must be the ref_id of the input for reduce and resample, for math is the formula"
  type        = string

  default = null
}

variable "reducer" {
  description = "The function to apply on time series to reduce it (mean, min, max, sum, count, last)"
  type        = string

  default = null # values: "mean", "min", "max", "sum", "count", "last"
}

variable "reduce_mode" {
  description = "strict: Result can be NaN if series contains non-numeric data | dropNN: Drop NaN, +/- and null from input series before reducing | replaceNN: Replace NaN, +/-Inf and null with a constant before reducing (variable 'reduce_replace_with')"
  type        = string

  default = null # values: "dropNN", "replaceNN", "strict"
}

variable "reduce_replace_with" {
  description = "When reduce_mode is 'replaceNN', use this value to replace all the NaN, +/-Inf and null values"
  type        = number

  default = null
}

variable "down_sampler" {
  description = "The reduction function to use when there are more than one data point per window sample (min, max, mean, sum)"
  type        = string

  default = null # values: min, max, mean, sum
}

variable "up_sampler" {
  description = "The method to use to fill a window sample that has no data points. pad: fills with the last know value | backfill: with next known value | fillna: to fill empty sample windows with NaNs"
  type        = string

  default = null # values: pad, backfilling, fillna
}

variable "window" {
  description = "The duration of time to resample to, for example 10s. Units may be s seconds, m for minutes, h for hours, d for days, w for weeks, and y of years"
  type        = string

  default = null
}

variable "conditions" {
  description = "List of conditions to fire the alert"
  type = list(object({
    evaluator_params    = list(number),            # 1 param for lt/gt and 2 params for outside_range/within_range, 0 for no_value
    evaluator_type      = string,                  # gt, lt, outside_range, within_range, no_value
    operator_type       = optional(string, "and"), # for multiple conditions
    query_ref_id_target = optional(string, "A"),
    reducer_type        = string, # sum, min, max, count, last, median, avg, count_non_null, diff, diff_abs, percent_diff, percent_diff_abs
  }))

  default = null
}
output.tf
output "model" {
  value = jsonencode(local.model)
}

output "ref_id" {
  value = var.ref_id
}

eraac avatar Oct 26 '22 11:10 eraac

Thanks @Eraac it works great. One more question that is maybe out of the scoop of this issue.

Is there a way to create "rule" argument in "grafana_rule_group" resource automatically using a loop like for_each ? I am aware the meta argument for_each applies to resouce, are you aware of something similar that can be used for an argument ?

example :

resource "grafana_rule_group" "my_alert_rule" {
    name = "My Rule Group"
    folder_uid = grafana_folder.rule_folder.uid
    interval_seconds = 240
    org_id = 1
    
    for_each = toset( ["rule1", "rule2", "rule3", "rule4"] )
    rule {
        name = each.key
     
```}
}

obounaim avatar Oct 27 '22 13:10 obounaim

@obounaim https://developer.hashicorp.com/terraform/language/expressions/dynamic-blocks

eraac avatar Oct 27 '22 14:10 eraac

I understand the trade-offs mentioned by @alexweav. It is really hard to maintain the Terraform-native definitions for every single supported data source, that, in the meantime, may change according to their evolution pace.

On practice though, you do not need that tons of supported data sources, you use just few. It seems totally possible that you create parameters.tf file with a summary of what makes every alert rule unique, while templating into the proper format (including JSON) on the final stage. The structure can be as @sammit20 suggested or simpler/harder depending on your needs. You write this thing once and then reuse it for every alert rule. It is impossible though to create an ideal solution for all, and every user must do it on their own based on what make sense for them.

greatvovan avatar Nov 04 '22 20:11 greatvovan