consul-alerts (Stale) Reminder doesnt get deleted

During the past weeks, we observed a stale reminder which didnt get deleted while the service is already back passing/healthy.

That means, critical notifications are continuing even if a service has already recovered and is back healthy:

` consul-alerts@xxx-europe-xxx

System is CRITICAL The following nodes are currently experiencing issues: Failed: 1 Warning: 0 Passed: 0

Node: xxxx Servicename: Servicename Since: 2020-01-26 07:21:06.216467314 +0100 CET Output:

HTTP GET http://localhost:xxxx/health: 200 OK Output: healthy`

The KV entry in consul-alerts/reminders/$hostname$ still exists when this happens. Once the entry is manually deleted, the errornous sent notifications stop.

Jan 29 '20 12:01 jmhofmann

I can confirm this is still the case.

My outputs also carry a timestamp, from which I can see that they are > 20 days old, consul-alerts has not fetched the new state from the consul service since then.

I also used consul kv get -recurse consul-alerts/reminders and consul kv get -recurse consul-alerts/ to check it.

Jul 07 '25 00:07 nh2

I found that restarting consul-alerts on one of the nodes clears all the old reminders, also those that were saved for completely different nodes.

For example: consul kv get -recurse consul-alerts/reminders

consul-alerts/reminders/my-node-4/myfs:{"Node":"my-node-4","ServiceId":"myfs","Service":"myfs","CheckId":"myfs","Check":"Service 'myfs' check","Status":"warning","Output":"HEALTH_WARN 1/3 mons down, quorum my-node-4,my-node-5; 1 datacenter (12 osds) down; 12 osds down; 1 host (12 osds) down; Degraded data redundancy: 38197594/340437540 objects degraded (11.220%), 348 pgs degraded, 348 pgs undersized","Notes":"","Interval":120,"RmdCheck":"2025-07-06T23:32:09.301785831Z","NotifList":{"log":true,"slack":true},"VarOverrides":{"email":null,"log":null,"influxdb":null,"slack":null,"mattermost":null,"mattermost-webhook":null,"pagerduty":null,"hipchat":null,"opsgenie":null,"awssns":null,"victorops":null,"http-endpoint":null,"ilert":null,"custom":null},"Timestamp":"2025-05-17T21:30:23.887209684Z"}

consul-alerts/reminders/my-node-5/myfs:{"Node":"my-node-5","ServiceId":"myfs","Service":"myfs","CheckId":"myfs","Check":"Service 'myfs' check","Status":"warning","Output":"HEALTH_WARN 1/3 mons down, quorum my-node-4,my-node-5; 1 datacenter (12 osds) down; 12 osds down; 1 host (12 osds) down; Degraded data redundancy: 38197594/340437540 objects degraded (11.220%), 348 pgs degraded, 348 pgs undersized","Notes":"","Interval":120,"RmdCheck":"2025-07-06T23:32:09.303745681Z","NotifList":{"log":true,"slack":true},"VarOverrides":{"email":null,"log":null,"influxdb":null,"slack":null,"mattermost":null,"mattermost-webhook":null,"pagerduty":null,"hipchat":null,"opsgenie":null,"awssns":null,"victorops":null,"http-endpoint":null,"ilert":null,"custom":null},"Timestamp":"2025-05-17T21:30:23.889748437Z"}

systemctl restart consul-alerts.service on just my-node-4 also cleared the key for consul-alerts/reminders/my-node-5/myfs after a few seconds delay.

Jul 07 '25 00:07 nh2

Hm, even after the restart that sent the clearning slack notification, my log file at /var/log/consul-alerts/consul-notifications.log has the lines

[consul-notifier] 2025/07/06 23:32:09 Node=my-node-4          Service=myfs        Check=Service 'myfs' check                       Status=warning
[consul-notifier] 2025/07/06 23:32:09 Node=my-node-5          Service=myfs        Check=Service 'myfs' check                       Status=warning

These are old lines (from before the restart), but it is surprising that with the restart it did not create any new log lines, in particular not some that have Status=passing lines.

Jul 07 '25 00:07 nh2