consul-alerts icon indicating copy to clipboard operation
consul-alerts copied to clipboard

No notification is triggered if unable to retrieve health check status

Open vidarh opened this issue 9 years ago • 2 comments

If the Consul connection fails or consul-alerts is otherwise unable to retrieve the health check status, it does two things that IMHO are broken:

  • It doesn't appear to trigger a notification.
  • It carries out backoff so that if cluster comes back online, the longer it has been offline the longer it will take before the status change is noticed.

I'm not sure yet if this is down to "consul watch" or "consul-alerts watch checks"

vidarh avatar Jan 30 '15 16:01 vidarh

Hmm.. yeah. Maybe we need to add a notification when consul-alerts can't access consul? Consul-alerts tries to retry connecting to consul and has a backoff so it doesn't try it all the time. This needs some revision since it might be waiting for too long even though consul is already up.

darkcrux avatar May 18 '15 08:05 darkcrux

I think the backoff in principle is fine, but there ought to be a (relatively low) ceiling on it so it doesn't keep increasing - the benefit of a backoff is primarily to prevent it from over-loading an already degraded system, but if e.g. one check a minute is overloading it, it's already too far gone to make much difference in my opinion.

Also I for one consider failure to get health checks as one of the most critical things to get alerts for in a monitoring system - if Consul fails in case I'm "flying blind". While in theory another consul-alerts instance would take over (and so you can sort-of work around it by having suitable health checks for the other Consul agents defined on each Consul agent, this is one thing I'm paranoid about...

E.g. consider a case where the problem affects all Consul nodes equally in short succession (e.g. a broken update causes the same situation to trigger failures in all of them). Might sound far fetched, but I've had it happen.

vidarh avatar May 18 '15 09:05 vidarh