ha-sap-terraform-deployments icon indicating copy to clipboard operation
ha-sap-terraform-deployments copied to clipboard

Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml)

Open pirat013 opened this issue 3 years ago • 2 comments

The current configuration for the systemd unit files are monitoring the active state like this:

  • name: systemd-services-monitoring rules:
    • alert: service-down-pacemaker expr: node_systemd_unit_state{name="pacemaker.service", state="active"} == 0 labels: severity: page annotations: summary: Pacemaker service not running

This would lead into false positive report due to maintenance work or other task when the systemd units are stop by an admin. I would suggest to change the monitoring rule from active to failed:

  • name: systemd-services-monitoring rules:
    • alert: service-failed-pacemaker expr: node_systemd_unit_state{name="pacemaker.service", state="failed"} == 1 labels: severity: page annotations: summary: Pacemaker service could not start or is crashed.

This would create less calls in regards to the situation a systemd unit is stop due to maintenance. If we would go this way we could think about to shorten the list and using a configuration like this:

  • alert: HostSystemdServiceCrashed expr: node_systemd_unit_state{state="failed"} == 1 for: 1m labels: severity: page annotations: description: |- systemd service crashed VALUE = {{ $value }} LABELS = {{ $labels }} summary: Host systemd service crashed (instance {{ $labels.instance }})

pirat013 avatar Nov 16 '21 14:11 pirat013

@pirat013 I would not see anything that speaks against this change. Are you willing to submit a PR?

yeoldegrove avatar Feb 09 '22 13:02 yeoldegrove

@yeoldegrove sorry I didn't see your request. We may have to consider a combination of service is enabled and not started as well this would reflect the original idea better than my suggestion. I'll try to figure out this rule and can create a PR. But I can't say when it will happen.

pirat013 avatar Apr 04 '22 10:04 pirat013