ha-sap-terraform-deployments
ha-sap-terraform-deployments copied to clipboard
Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml)
The current configuration for the systemd unit files are monitoring the active state like this:
- name: systemd-services-monitoring
rules:
- alert: service-down-pacemaker expr: node_systemd_unit_state{name="pacemaker.service", state="active"} == 0 labels: severity: page annotations: summary: Pacemaker service not running
This would lead into false positive report due to maintenance work or other task when the systemd units are stop by an admin. I would suggest to change the monitoring rule from active to failed:
- name: systemd-services-monitoring
rules:
- alert: service-failed-pacemaker expr: node_systemd_unit_state{name="pacemaker.service", state="failed"} == 1 labels: severity: page annotations: summary: Pacemaker service could not start or is crashed.
This would create less calls in regards to the situation a systemd unit is stop due to maintenance. If we would go this way we could think about to shorten the list and using a configuration like this:
- alert: HostSystemdServiceCrashed expr: node_systemd_unit_state{state="failed"} == 1 for: 1m labels: severity: page annotations: description: |- systemd service crashed VALUE = {{ $value }} LABELS = {{ $labels }} summary: Host systemd service crashed (instance {{ $labels.instance }})
@pirat013 I would not see anything that speaks against this change. Are you willing to submit a PR?
@yeoldegrove sorry I didn't see your request. We may have to consider a combination of service is enabled and not started as well this would reflect the original idea better than my suggestion. I'll try to figure out this rule and can create a PR. But I can't say when it will happen.