mimir icon indicating copy to clipboard operation
mimir copied to clipboard

User circuit breaker remaining delay to calculate the Retry-After delay

Open duricanikolic opened this issue 1 year ago • 4 comments

What this PR does

This PR changes the current procedure for calculating the Retry-After delay by taking into account the remaining circuit breaker delay. Currently the Retry-After delay is calculated this way:

  • we select a random value r between baseSeconds * 2^(retryAttempt - 1) (inclusive) and baseSeconds * 2^retryAttempt) (exclusive),
  • we return the minimium between r and baseSeconds * 2^maxBackoffExponent, because the delay must never be higher than the latter,

where retryAttempt is the current retry attempt present in the request's header, while baseSeconds and maxBackoffExponent are configured values of CLI flags -distributor.retry-after-header.base-seconds and distributor.retry-after-header.max-backoff-exponent.

This PR introduces a possibility to force the Retry-After calculation to wait at least until an open circuit breaker retries, i.e.,

  • we select a random value r between baseSeconds * 2^(retryAttempt - 1) (inclusive) and baseSeconds * 2^retryAttempt) (exclusive),
  • we choose the maximum between r and circuitBreakerRemainingDelay
  • if the value calculated at the previous step is higher than the maximum allowed value baseSeconds * 2^maxBackoffExponent, we return the latter,

where circuitBreakerRemainingDelay is the remaining time in seconds an open circuit breaker will remain in the open state before it retries.

In order to allow the new way of calculation, the newly added CLI flag -distributor.retry-after-header.circuit-breaker-awareness-enabled must be set to true.

Checklist

  • [x] Tests updated.
  • [x] Documentation added.
  • [x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • [ ] about-versioning.md updated with experimental features.

duricanikolic avatar Feb 09 '24 12:02 duricanikolic

Would this cause a thundering herd effect for all requests that hit the same distributor replica - they'd all retry at the same time?

Maybe @ying-jeanne can pitch in since she did some experiments with retry behavior

dimitarvdimitrov avatar Feb 12 '24 07:02 dimitarvdimitrov

@dimitarvdimitrov is right, the current behaviour is not really as described. retry-after is guaranteed always random to prevent the problem described above ⬆️ . When the retry attemps <= max retry attemps: retryafter = random[2^(retry-1), 2^retry] And when the retry attemps > max retry attemps: retryafter = random[2^(max_retry-1), 2^max_retry]

In your case, how circuitBreakerRemainingDelay is calculated or decided to set? @duricanikolic

ying-jeanne avatar Feb 12 '24 09:02 ying-jeanne

If circuitBreakerRemainingDelay itself doesn't have any randomness, I am thinking: after minSeconds and maxSeconds calculated with the previous logic, just add in the end of the function

if circuitBreakerRemainingDelay > minSeconds && circuitBreakerRemainingDelay < maxSeconds {
    retry = random[circuitBreakerRemainingDelay, maxSeconds]
} else {
    retry = random[minSeconds, maxSeconds]
}

Since when circuitBreakerRemainingDelay is even smaller than the minSeconds, no need to consider it. When it is even higher than maxSeconds, we don't want to consider it, since it might goes even more than the timeout. The only case we consider it would be when it falls in between, in this case we send a random number between circuitBreakerRemainingDelay and maxSeconds. Maybe this could a solution for the problem you are solving now.

ying-jeanne avatar Feb 12 '24 13:02 ying-jeanne

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

duricanikolic avatar Mar 12 '24 09:03 duricanikolic