PrometheusAlert icon indicating copy to clipboard operation
PrometheusAlert copied to clipboard

Prometheus告警规则示例:欢迎大家分享

Open Zhang21 opened this issue 5 years ago • 8 comments

这个网站有很多Prometheus告警规则样例: https://awesome-prometheus-alerts.grep.to/



# centos6和7的内存空闲量计算
node_memory_MemAvailable_bytes or (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes + node_memory_Slab_bytes)

一个prometheus rules的示例,level用作区分告警方式,level, kind用作告警抑制方式。


groups:
- name: node-cpu
  rules:
  # cpu核数
  - record: instance:node_cpus:count
    expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
  # 每个cpu使用率
  - record: instance_cpu:node_cpu_seconds_not_idle:rate1m
    expr: sum without (mode) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m]))
  # 总cpu使用率
  - record: instance:node_cpu_utilization:ratio
    expr: avg without (cpu) (instance_cpu:node_cpu_seconds_not_idle:rate1m)

  - alert: cpu使用率大于88%
    expr: instance:node_cpu_utilization:ratio * 100 > 88
    for: 5m
    labels:
      severity: critical
      level: 3
    annotations:
      summary: "cpu使用率大于85%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
  - alert: cpu使用率大于93%
    expr: instance:node_cpu_utilization:ratio * 100 > 93
    for: 2m
    labels:
      severity: emergency
      level: 4
    annotations:
      summary: "cpu使用率大于93%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
      wxurl: "webhook1, webhook2"
      mobile: "13xxx, 15xxx"

  - alert: cpu负载大于Cores
    expr: node_load5 > instance:node_cpus:count
    for: 5m
    labels:
      severity: warning
      level: 2
    annotations:
      summary: "cpu负载大于Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
  - alert: cpu负载大于2Cores
    expr: node_load1 > (instance:node_cpus:count * 2) 
    for: 4m
    labels:
      severity: critical
      level: 3
    annotations:
      summary: "cpu负载大于2Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
      alertgroup: ops

在特定时间触发/不触发告警,参考: https://www.robustperception.io/combining-alert-conditions

groups:
- name: 指定特定时间范围
  rules:
  - alert: 凌晨0点到6点不触发告警
    # prometheus默认是utc时间,请注意
    expr: promQL表达式 and ON() (hour() < 16  > 22)

Zhang21 avatar Dec 17 '20 03:12 Zhang21

你好!我先把wxurl这改为emailurl可以写为emailurl:url吗

zhangcaiyuan avatar Mar 31 '21 05:03 zhangcaiyuan

您好,自定义模板方式如何对一个报警同时发送过个渠道(如既发送webhook、邮件通知又发送短信通知),我测试所得到的结果是只有一种渠道可以接收到报警消息,alertmanager测试配置是这样的:

global: resolve_timeout: 5m route: group_by: ['gateway'] group_wait: 10s group_interval: 10s repeat_interval: 5m receiver: 'webhook' routes:

  • receiver: 'prometheusalert-email'
  • receiver: 'prometheusalert-dd' receivers:
  • name: 'webhook' webhook_configs:
    • url: 'http://172.0.0.1:8891/alertmanager/addAlert/'
  • name: 'prometheusalert-email' webhook_configs:
    • url: 'http://172.0.0.1:8080/prometheusalert?type=email&tpl=prometheus-email&email=t***@126.com'
  • name: 'prometheusalert-dd' webhook_configs:
    • url: 'http://172.0.0.1:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=&at='

tommy19810810 avatar Nov 13 '21 08:11 tommy19810810

您好,自定义模板方式如何对一个报警同时发送过个渠道(如既发送webhook、邮件通知又发送短信通知),我测试所得到的结果是只有一种渠道可以接收到报警消息,alertmanager测试配置是这样的:

global: resolve_timeout: 5m route: group_by: ['gateway'] group_wait: 10s group_interval: 10s repeat_interval: 5m receiver: 'webhook' routes:

  • receiver: 'prometheusalert-email'

  • receiver: 'prometheusalert-dd' receivers:

  • name: 'webhook' webhook_configs:

    • url: 'http://172.0.0.1:8891/alertmanager/addAlert/'
  • name: 'prometheusalert-email' webhook_configs:

    • url: 'http://172.0.0.1:8080/prometheusalert?type=email&tpl=prometheus-email&email=t***@126.com'
  • name: 'prometheusalert-dd' webhook_configs:

    • url: 'http://172.0.0.1:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=&at='

在 receiver 下面添加 continue: true

michael-liumh avatar Feb 17 '22 07:02 michael-liumh

您好,请问有Oracle的告警规则吗?

qianglatiao avatar Jul 20 '22 09:07 qianglatiao

没有,Oracle插件没有使用

---原始邮件--- 发件人: @.> 发送时间: 2022年7月20日(周三) 下午5:58 收件人: @.>; 抄送: @.@.>; 主题: Re: [feiyu563/PrometheusAlert] Prometheus告警规则示例:欢迎大家分享 (#89)

您好,请问有Oracle的告警规则吗?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

zhangcaiyuan avatar Jul 20 '22 10:07 zhangcaiyuan

请问

这个网站有很多Prometheus告警规则样例: https://awesome-prometheus-alerts.grep.to/

# centos6和7的内存空闲量计算
node_memory_MemAvailable_bytes or (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes + node_memory_Slab_bytes)

一个prometheus rules的示例,level用作区分告警方式,level, kind用作告警抑制方式。

groups:
- name: node-cpu
  rules:
  # cpu核数
  - record: instance:node_cpus:count
    expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
  # 每个cpu使用率
  - record: instance_cpu:node_cpu_seconds_not_idle:rate1m
    expr: sum without (mode) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m]))
  # 总cpu使用率
  - record: instance:node_cpu_utilization:ratio
    expr: avg without (cpu) (instance_cpu:node_cpu_seconds_not_idle:rate1m)

  - alert: cpu使用率大于88%
    expr: instance:node_cpu_utilization:ratio * 100 > 88
    for: 5m
    labels:
      severity: critical
      level: 3
      kind: CpuUsage
    annotations:
      summary: "cpu使用率大于85%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"

  - alert: cpu使用率大于93%
    expr: instance:node_cpu_utilization:ratio * 100 > 93
    for: 2m
    labels:
      severity: emergency
      level: 4
      kind: CpuUsage
    annotations:
      summary: "cpu使用率大于93%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
      wxurl: "webhook1, webhook2"
      mobile: "13xxx, 15xxx"

  - alert: cpu负载大于Cores
    expr: node_load5 > instance:node_cpus:count
    for: 5m
    labels:
      severity: warning
      level: 2
      kind: CpuLoad
    annotations:
      summary: "cpu负载大于Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"

  - alert: cpu负载大于2Cores
    expr: node_load5 > (instance:node_cpus:count * 2) 
    for: 2m
    labels:
      severity: critical
      level: 3
      kind: CpuLoad
    annotations:
      summary: "cpu负载大于2Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
      wxurl: "webhook1, webhook2"

在特定时间触发/不触发告警,参考: https://www.robustperception.io/combining-alert-conditions

groups:
- name: 指定特定时间范围
  rules:
  - alert: 凌晨0点到6点不触发告警
    # prometheus默认是utc时间,请注意
    expr: promQL表达式 and ON() (hour() < 16  > 22)

请问这个怎么用的呢,没在文档中找到。这个webhook1代表的地址在哪儿配置呢,app.conf?还是说把多个地址原文直接写在这个里面么

wxurl: "webhook1, webhook2"
mobile: "13xxx, 15xxx"

running-db avatar Nov 05 '23 09:11 running-db

@running-db 多个地址写在里面。你看文档上都有写的。后面的功能上加上了告警组的功能,可以将告警组配置在 app.conf 配置里,然后 rules 里填写对应的一个/多个告警组就可以。

Zhang21 avatar Nov 13 '23 02:11 Zhang21