gatus icon indicating copy to clipboard operation
gatus copied to clipboard

feat(endpoint): shorten interval on error

Open vax-r opened this issue 1 year ago • 1 comments

Summary

This PR enables user to set a shorter interval when an endpoint status check isn't success Note that you have to add a field called failed-interval in endpoints under config.yaml

fix #112

@TwiN , I haven't write the document yet, after your review I'll write this feature into README.md and alter the example in config.yaml

Checklist

  • [X] Tested and/or added tests to validate that the changes work as intended, if applicable.
  • [ ] Updated documentation in README.md, if applicable.

vax-r avatar Aug 18 '23 02:08 vax-r

So there's a reason why I haven't implemented this sooner. My biggest concern is, if we have a retry interval (or failed interval, as you named it), its interaction with the alert configuration is no longer implicitly obvious/evident, if that makes sense.

For instance, let's say you set the normal interval to 3m, and the retry interval to 5 seconds. This would imply that once every 3 minutes, the endpoint's health is evaluated, and if it fails, it will retry 5 seconds later. If that fails again, then it would presumably stop retrying and the next check would be in 3 minutes.

The problems with the statement above are the following:

  • I made an assumption that there would only be one retry. This isn't exactly clear, and that is reinforced by the fact that your implementation seems to retry forever until it succeeds (but what if it doesn't? The increase in load would likely not help the situation, especially if the underlying endpoint being retried upon performs intensive tasks behind the scenes)
  • What about the alerts?
    • Does the failure-threshold include the retries?
      • One could assume that it should, as it may allow us to trigger an alert sooner which users could then fix sooner, but what about the success-threshold then? Would it make sense for alerts to be resolved faster too?
  • How should this be displayed on the UI? Should successful retries show the failed retries? If multiple retries fail, wouldn't that not pollute the UI by showing multiple failures in a row when these failures in fact happened in the shorter time frame defined by the retry interval?

What I'm trying to get at is, I'm really not sure I want to add this because it makes the configuration less intuitive.

TwiN avatar Sep 02 '23 19:09 TwiN

As there has been no reply for several months, I'm going to close this.

TwiN avatar Feb 08 '24 01:02 TwiN