awesome-prometheus-alerts icon indicating copy to clipboard operation
awesome-prometheus-alerts copied to clipboard

Refacto: write more accurate descriptions for faster troubleshooting

Open samber opened this issue 4 years ago • 3 comments

Example:

From:

- name: Prometheus rule evaluation failures
  description: 'Prometheus encountered {{ $value }} rule evaluation failures.'
  query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'

To:

- name: Prometheus rule evaluation failures
  description: 'Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.'
  query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'

An effect field would enable us to improve alert template.

samber avatar Mar 08 '20 13:03 samber

I would welcome an effect field. I've solved this locally by including an effect and it's very helpful to reduce the size of the description when only that's required (on a status board) but include specific resolutions in slack messages for example.

e.g. Screenshot 2020-04-30 at 14 22 15

robert-will-brown avatar Apr 30 '20 13:04 robert-will-brown

We can probably find a balance between:

  • Description/cause
  • Effects
  • Resolution guidelines

Gitlab infrastructure team adds a reference to a troubleshooting markdown.

See:

  • https://gitlab.com/gitlab-com/runbooks/blob/master/rules/prometheus-metamons.yml
  • https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/monitoring/node_memory_alerts.md

samber avatar May 03 '20 19:05 samber

Effects

This is what alert name should be about as this is the first thing operator sees when receives alert. Additionally, this could be enhanced by summary annotation field.

Description/cause

In prometheus community this is usually done with either message field (for example in kubernetes-monitoring/kubernetes-mixin project or with description field (example in node-mixin project).

Resolution guidelines

This is basically a runbook/SOP. For example kubernetes-mixin project includes those as runbook_url as a field in alert annotations.

Such runbooks are located in one file, and links are made to specific anchors.

This field is usually the most problematic one, as creating a runbook needs a deep knowledge of the system itself.


Essentially those are problems already solved by the prometheus community.

paulfantom avatar May 04 '20 09:05 paulfantom