(Stale) Reminder doesnt get deleted
During the past weeks, we observed a stale reminder which didnt get deleted while the service is already back passing/healthy.
That means, critical notifications are continuing even if a service has already recovered and is back healthy:
` consul-alerts@xxx-europe-xxx
System is CRITICAL The following nodes are currently experiencing issues: Failed: 1 Warning: 0 Passed: 0
Node: xxxx Servicename: Servicename Since: 2020-01-26 07:21:06.216467314 +0100 CET Output:
HTTP GET http://localhost:xxxx/health: 200 OK Output: healthy`
The KV entry in consul-alerts/reminders/$hostname$ still exists when this happens. Once the entry is manually deleted, the errornous sent notifications stop.
I can confirm this is still the case.
My outputs also carry a timestamp, from which I can see that they are > 20 days old, consul-alerts has not fetched the new state from the consul service since then.
I also used consul kv get -recurse consul-alerts/reminders and consul kv get -recurse consul-alerts/ to check it.
I found that restarting consul-alerts on one of the nodes clears all the old reminders, also those that were saved for completely different nodes.
For example: consul kv get -recurse consul-alerts/reminders
consul-alerts/reminders/my-node-4/myfs:{"Node":"my-node-4","ServiceId":"myfs","Service":"myfs","CheckId":"myfs","Check":"Service 'myfs' check","Status":"warning","Output":"HEALTH_WARN 1/3 mons down, quorum my-node-4,my-node-5; 1 datacenter (12 osds) down; 12 osds down; 1 host (12 osds) down; Degraded data redundancy: 38197594/340437540 objects degraded (11.220%), 348 pgs degraded, 348 pgs undersized","Notes":"","Interval":120,"RmdCheck":"2025-07-06T23:32:09.301785831Z","NotifList":{"log":true,"slack":true},"VarOverrides":{"email":null,"log":null,"influxdb":null,"slack":null,"mattermost":null,"mattermost-webhook":null,"pagerduty":null,"hipchat":null,"opsgenie":null,"awssns":null,"victorops":null,"http-endpoint":null,"ilert":null,"custom":null},"Timestamp":"2025-05-17T21:30:23.887209684Z"}
consul-alerts/reminders/my-node-5/myfs:{"Node":"my-node-5","ServiceId":"myfs","Service":"myfs","CheckId":"myfs","Check":"Service 'myfs' check","Status":"warning","Output":"HEALTH_WARN 1/3 mons down, quorum my-node-4,my-node-5; 1 datacenter (12 osds) down; 12 osds down; 1 host (12 osds) down; Degraded data redundancy: 38197594/340437540 objects degraded (11.220%), 348 pgs degraded, 348 pgs undersized","Notes":"","Interval":120,"RmdCheck":"2025-07-06T23:32:09.303745681Z","NotifList":{"log":true,"slack":true},"VarOverrides":{"email":null,"log":null,"influxdb":null,"slack":null,"mattermost":null,"mattermost-webhook":null,"pagerduty":null,"hipchat":null,"opsgenie":null,"awssns":null,"victorops":null,"http-endpoint":null,"ilert":null,"custom":null},"Timestamp":"2025-05-17T21:30:23.889748437Z"}
systemctl restart consul-alerts.service on just my-node-4 also cleared the key for consul-alerts/reminders/my-node-5/myfs after a few seconds delay.
Hm, even after the restart that sent the clearning slack notification, my log file at /var/log/consul-alerts/consul-notifications.log has the lines
[consul-notifier] 2025/07/06 23:32:09 Node=my-node-4 Service=myfs Check=Service 'myfs' check Status=warning
[consul-notifier] 2025/07/06 23:32:09 Node=my-node-5 Service=myfs Check=Service 'myfs' check Status=warning
These are old lines (from before the restart), but it is surprising that with the restart it did not create any new log lines, in particular not some that have Status=passing lines.