Smart retries with backoff
Problem
Checks are retried very quickly after failing once. For checks that run relatively infrequent, e.g. once every 10 minutes at longer, this can create a longer down time report, because it will take at least 10 minutes for the next check interval.
This creates skewed metrics on how long a particular endpoint or site was down.
Possible solution
We should make our retries smarter by retrying checks with a back-off. E.g.
- 10 minute interval check fails, do instant retry.
- check again after 1 minute.
- check again after 2 minutes.
- check again after 5 minutes.
Considerations
- We need to take into account the current frequency at which the check runs.
- We need to take into account the region, so a check failing in
us-east-1should be retried onus-east-1. Possibly this can be togglable by the user.
Stretch goal
Saving and showing the retried requests in the UI and in the API will help triaging any failed requests. We will need to exempt retries from the availability and performance metrics.
referencing https://github.com/checkly/public-roadmap/issues/177
@tnolet Any chance you can leave this up for the user to decide?, it would be great if we could choose among back-off strategies like fibonacci, exponential, linear, etc...
Also if right now it's retried immediately, this is still useful for detecting flapping resources.
@Coolomina that is an interesting suggestion, thanks