consul-alerts icon indicating copy to clipboard operation
consul-alerts copied to clipboard

consul-alerts stops working after some time

Open nh2 opened this issue 6 years ago • 1 comments

Background: I've been running consul-alerts for years in a cluster of 3 consul servers.

The last restart I've done was ~4 months ago.

Today external website monitoring alerted me that one of my 3 machines was down (the reason being that the underlying VPN excluded it), including 1 of the 3 consul servers. I was surprised that external monitoring told me that, and consul-alerts did not, as that is its main job.

I checked the configured consul-alerts/config/notifiers/log/path on all 3 machines, and the last outputs as recorded in those log files were from 7 days ago, 6 days ago, and 3 days ago respectively.

System logs on all 3 machines:

[root@node-1:~]# systemctl status consul-alerts | cat
● consul-alerts.service
   Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2019-11-03 02:23:58 UTC; 4 months 26 days ago
 Main PID: 1703 (consul-alerts)
    Tasks: 83 (limit: 4915)
   CGroup: /system.slice/consul-alerts.service
           ├─ 1703 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
           ├─ 1769 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn
           ├─ 1770 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
           └─17052 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
[root@node-2:~]# systemctl status consul-alerts | cat
● consul-alerts.service
   Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2020-01-17 17:50:03 UTC; 2 months 12 days ago
 Main PID: 21793 (consul-alerts)
    Tasks: 76 (limit: 4915)
   CGroup: /system.slice/consul-alerts.service
           ├─21793 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
           ├─21806 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
           └─21807 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn

Mar 30 16:14:05 node-2 consul-alerts[21793]: 2020/03/30 16:14:05 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:17:12 node-2 consul-alerts[21793]: 2020/03/30 16:17:12 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:20:20 node-2 consul-alerts[21793]: 2020/03/30 16:20:20 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:23:27 node-2 consul-alerts[21793]: 2020/03/30 16:23:27 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:26:34 node-2 consul-alerts[21793]: 2020/03/30 16:26:34 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:29:41 node-2 consul-alerts[21793]: 2020/03/30 16:29:41 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:32:48 node-2 consul-alerts[21793]: 2020/03/30 16:32:48 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:35:56 node-2 consul-alerts[21793]: 2020/03/30 16:35:56 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:39:03 node-2 consul-alerts[21793]: 2020/03/30 16:39:03 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:42:10 node-2 consul-alerts[21793]: 2020/03/30 16:42:10 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
[root@node-3:~]# systemctl status consul-alerts | cat
● consul-alerts.service
   Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2019-11-03 02:24:38 UTC; 4 months 26 days ago
 Main PID: 1592 (consul-alerts)
    Tasks: 76 (limit: 4915)
   CGroup: /system.slice/consul-alerts.service
           ├─1592 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
           ├─1624 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
           └─1625 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn

Mar 19 13:27:24 node-3 consul-alerts[1592]: 2020/03/19 13:27:24 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (rpc error making call: EOF), retry in 5s
Mar 19 13:27:36 node-3 consul-alerts[1592]: 2020/03/19 13:27:36 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 20s
Mar 30 15:07:48 node-3 consul-alerts[1592]: 2020/03/30 15:07:48 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (rpc error making call: EOF), retry in 5s
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Note the Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. is not important here -- that is what systemd prints for log rotation, but the newer (perhaps relevant) logs are still present here).

In the output we can see that node-1 logged no errors, node-2 was in a retry loop (makes sense given that it is the one that was kicked out of the VPN, so it cannot talk to the Consul server quorum), and node 3 apparently had a short error today, but retried only once because its retry did not grow beyond 5s.

If I log into the consul dashboard (localhost:8500 on either server), I can see that lots of services are marked red.

Yet, consul-alerts today sent me no notifications at all -- neither to Slack (as I have it configured), nor into the configured log file. It usually does notify my.

So the question is: Why did it fail as a watchdog today?

I would have expected it to notify me of the errors shown in the Consul UI shortly after they ocurred.

A systemctl restart consul-alerts on node-1 immediately resulted in a Slack notification, and correct entries in the log file (but only in the log file of node-3!).

So it seems that something got stuck, and only manual restarting revived it (which is of course not good for an automated high-availability watchdog).

I cannot currently tell whether the fault is in consul-alerts, or in consul itself (consul-alerts starts consul watch, so in case that one hung and didn't output anything, it would be consul's fault).

Maybe anybody knows anything?

nh2 avatar Mar 30 '20 17:03 nh2

Similar situation the other way around, where an alert did not get cleared for 20 days after it recovered:

  • https://github.com/AcalephStorage/consul-alerts/issues/275#issuecomment-3043174024

nh2 avatar Jul 07 '25 00:07 nh2