public-roadmap icon indicating copy to clipboard operation
public-roadmap copied to clipboard

Smart retries with backoff

Open tnolet opened this issue 3 years ago • 3 comments

Problem

Checks are retried very quickly after failing once. For checks that run relatively infrequent, e.g. once every 10 minutes at longer, this can create a longer down time report, because it will take at least 10 minutes for the next check interval.

This creates skewed metrics on how long a particular endpoint or site was down.

Possible solution

We should make our retries smarter by retrying checks with a back-off. E.g.

  1. 10 minute interval check fails, do instant retry.
  2. check again after 1 minute.
  3. check again after 2 minutes.
  4. check again after 5 minutes.

Considerations

  • We need to take into account the current frequency at which the check runs.
  • We need to take into account the region, so a check failing in us-east-1 should be retried on us-east-1. Possibly this can be togglable by the user.

Stretch goal

Saving and showing the retried requests in the UI and in the API will help triaging any failed requests. We will need to exempt retries from the availability and performance metrics.

tnolet avatar Apr 26 '22 10:04 tnolet

referencing https://github.com/checkly/public-roadmap/issues/177

tnolet avatar Jul 07 '22 15:07 tnolet

@tnolet Any chance you can leave this up for the user to decide?, it would be great if we could choose among back-off strategies like fibonacci, exponential, linear, etc...

Also if right now it's retried immediately, this is still useful for detecting flapping resources.

Coolomina avatar Sep 22 '22 12:09 Coolomina

@Coolomina that is an interesting suggestion, thanks

tnolet avatar Sep 26 '22 15:09 tnolet