Inhibition rule not working
We have some inhibition rules that are working as expected, but we are trying to add a new inhibition rule without the "equal" field, and it is not working. We also tested the new rule with the "equal" field like the other rules, but it still did not work.
Working Rules:
inhibit_rules:
- source_match:
alertname: "BlackboxProbeFailed"
target_match_re:
severity: "very high|high|warning"
equal: ["hostname"]
- source_match:
alertname: "Network-Down"
target_match_re:
alertname: "BlackboxProbeFailed|Host-DOWN|prometheus-heartbeat"
equal: ["category"]
New rule is added after the above rules without equal field:
- source_match:
alertname: "Test-service-cron"
target_match_re:
alertname: "Test-service-sshd"
Below are test alerting rules created for the same:
- alert: Test-service-cron
expr: node_systemd_unit_state{name="cron.service",exported_state="active"} == 0
for: 5m
labels:
severity: very high
category: Exceptions
annotations:
description: "Service has been down for over 5 minutes in - {{$labels.hostname}}"
summary: "RED - {{$labels.hostname}} - CRON Service down"
- alert: Test-service-sshd
expr: node_systemd_unit_state{name="sshd.service",exported_state="active"} == 0
for: 5m
labels:
severity: very high
spc: disabled
category: Exceptions
annotations:
description: "Service sshd has been down for over 5 minutes in - {{$labels.hostname}}"
summary: "RED - {{$labels.hostname}} - SSHD Service down"
To test the new rule, we first stopped the cron service. Once the "Test-service-cron" alert was fired, we stopped the sshd service. However, the "Test-service-sshd" alert also fired, indicating that the inhibition rule is not working as expected. The inhibition rule should suppress the target alert, but it did not. We verified the alert firing status through the "ALERTS" metric.
Questions:
- Are there any specific requirements or conditions for inhibition rules to work without the "equal" field?
- Could there be any conflicts or precedence issues with the existing inhibition rules that might affect the new rule?
- Could there be any version-specific issues or bugs related to inhibition rules that we should be aware of?
I haven't been able to reproduce this I'm afraid, it works for me. Here is the configuration file:
receivers:
- name: test
route:
receiver: test
inhibit_rules:
- source_match:
alertname: "BlackboxProbeFailed"
target_match_re:
severity: "very high|high|warning"
equal: ["hostname"]
- source_match:
alertname: "Network-Down"
target_match_re:
alertname: "BlackboxProbeFailed|Host-DOWN|prometheus-heartbeat"
equal: ["category"]
- source_match:
alertname: "Test-service-cron"
target_match_re:
alertname: "Test-service-sshd"
I added the two alerts:
./amtool --alertmanager.url=http://127.0.0.1:9093 alert add alertname=Test-service-cron
./amtool --alertmanager.url=http://127.0.0.1:9093 alert add alertname=Test-service-sshd
The debug logs show Test-service-sshd being inhibited:
time=2025-01-16T10:49:47.624Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=Test-service-cron[8bc38d5][active]
time=2025-01-16T10:49:51.561Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=Test-service-sshd[643222c][active]
time=2025-01-16T10:50:17.626Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup={}:{} alerts="[Test-service-cron[8bc38d5][active] Test-service-sshd[643222c][active]]"
time=2025-01-16T10:50:17.626Z level=DEBUG source=notify.go:579 msg="Notifications will not be sent for muted alerts" component=dispatcher alerts=[Test-service-sshd[643222c][active]] reason=inhibition
And so does the API:
[
{
"annotations": {},
"endsAt": "2025-01-16T10:54:51.561Z",
"fingerprint": "643222c68932c063",
"receivers": [
{
"name": "test"
}
],
"startsAt": "2025-01-16T10:49:51.561Z",
"status": {
"inhibitedBy": [
"8bc38d5516aaa89d"
],
"mutedBy": [],
"silencedBy": [],
"state": "suppressed"
},
"updatedAt": "2025-01-16T10:49:51.561Z",
"labels": {
"alertname": "Test-service-sshd"
}
},
{
"annotations": {},
"endsAt": "2025-01-16T10:54:47.624Z",
"fingerprint": "8bc38d5516aaa89d",
"receivers": [
{
"name": "test"
}
],
"startsAt": "2025-01-16T10:49:47.624Z",
"status": {
"inhibitedBy": [],
"mutedBy": [],
"silencedBy": [],
"state": "active"
},
"updatedAt": "2025-01-16T10:49:47.624Z",
"labels": {
"alertname": "Test-service-cron"
}
}
]
Could you do the equivalent test and share the debug logs from your Alertmanager, so we can compare?
@ricksj5 can I close this, or is the issue still occuring?
still occuring
still occuring
Do you have the same configuration as @ricksj5 or a different configuration? If it's different, could you share it with debug logs for dispatch.go and notify.go.
we have same configuration as @ricksj5 . Please can you let us know how to enable debug logs in victoria metrics alertmanager values.yaml In our case both the alerts are fired and there is nothing in logs in vmalert-server and vmalert-alertmanager pods
You'll need to ask the VictoriaMetrics folks that, we are not familiar with vmalert-server or vmalert-alertmanager.
Hello @pragzsing @grobinson-grafana, a VictoriaMetrics contributor here. I'd like to keep the discussion in one place, so will respond in this issue instead of separate issue in our repo. Please, let me know if you prefer to move this to an issue in our repo.
vmalert-alertmanager is just an Alertmanager deployment managed by our operator's CRD. In order to enable debug logs you'll need to add this:
spec:
logLevel: "debug"
vmalert-server is a vmalert component which is responsible for alerts evaluation. It evaluates alerts against VictoriaMetrics and sends notifications to Alertmanager. You can find information about debugging alerts evaluated by vmalert here.
We have enabled the debug logs and attempted to inhibit the alert. While we can see that notifications are muted for the target alert (Test-service-sshd), the logs do not show "reason=inhibition" as seen in the logs you shared. Also, we observed that for the target alert (Test-service-sshd), a ticket was not created in our ticketing tool as per expectation, where we send notifications to create the ticket. But I'm not sure why both alerts are showing in the firing state on Grafana, which I believe should not be the case for inhibited/target alert as shown in the snapshot attached.
The debug logs show Test-service-sshd being muted but not showing reason inhibition:
ts=2025-04-02T14:21:03.079Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert=Test-service-sshd[b034de1][active]
ts=2025-04-02T14:21:03.081Z caller=dispatch.go:164 level=debug component=dispatcher msg="Received alert" alert=Test-service-cron[884f08d][active]
ts=2025-04-02T14:21:03.083Z caller=dispatch.go:516 level=debug component=dispatcher aggrGroup="{}/{spc_vlab=\"enabled\"}:{alertgroup=\"services\", alertname=\"Test-service-sshd\", exported_state=\"active\", exported_type=\"notify\", name=\"sshd.service\", severity=\"very high\", state=\"Live\"}" msg=flushing alerts=[Test-service-sshd[b034de1][active]]
ts=2025-04-02T14:21:03.083Z caller=notify.go:551 level=debug component=dispatcher msg="Notifications will not be sent for muted alerts" alerts=[Test-service-sshd[b034de1][active]]
We have enabled the debug logs and attempted to inhibit the alert. While we can see that notifications are muted for the target alert (Test-service-sshd), the logs do not show "reason=inhibition" as seen in the logs you shared.
It sounds to me like you might be using an older version of Alertmanager? The reason=inhibited message was added in Alertmanager 0.28.0.
Also, we observed that for the target alert (Test-service-sshd), a ticket was not created in our ticketing tool as per expectation, where we send notifications to create the ticket.
Good, sounds like it is working as expected?
But I'm not sure why both alerts are showing in the firing state on Grafana, which I believe should not be the case for inhibited/target alert as shown in the snapshot attached.
What is the metric being queried here?
We are using this metric - ALERTS{alertname=~"Test-service.*"}
That metric comes from Prometheus, not the Alertmanager, so it won't include information about silences or inhibitions. https://github.com/prometheus/prometheus/blob/8ad21d0659862ff320641aa28bd5928b2c603d05/rules/alerting.go#L41
You'll want to use the metric alertmanager_alerts, which will include information about supressed (silenced and inhibited) alerts.
We did not find this metric - alertmanager_alerts
If you haven't set up Prometheus to scrape metrics from your Alertmanager server, then you won't be able to query them.