mimir
mimir copied to clipboard
User circuit breaker remaining delay to calculate the Retry-After delay
What this PR does
This PR changes the current procedure for calculating the Retry-After
delay by taking into account the remaining circuit breaker delay. Currently the Retry-After
delay is calculated this way:
- we select a random value
r
betweenbaseSeconds * 2^(retryAttempt - 1)
(inclusive) andbaseSeconds * 2^retryAttempt)
(exclusive), - we return the minimium between
r
andbaseSeconds * 2^maxBackoffExponent
, because the delay must never be higher than the latter,
where retryAttempt
is the current retry attempt present in the request's header, while baseSeconds
and maxBackoffExponent
are configured values of CLI flags -distributor.retry-after-header.base-seconds
and distributor.retry-after-header.max-backoff-exponent
.
This PR introduces a possibility to force the Retry-After
calculation to wait at least until an open circuit breaker retries, i.e.,
- we select a random value
r
betweenbaseSeconds * 2^(retryAttempt - 1)
(inclusive) andbaseSeconds * 2^retryAttempt)
(exclusive), - we choose the maximum between
r
andcircuitBreakerRemainingDelay
- if the value calculated at the previous step is higher than the maximum allowed value
baseSeconds * 2^maxBackoffExponent
, we return the latter,
where circuitBreakerRemainingDelay
is the remaining time in seconds an open circuit breaker will remain in the open state before it retries.
In order to allow the new way of calculation, the newly added CLI flag -distributor.retry-after-header.circuit-breaker-awareness-enabled
must be set to true
.
Checklist
- [x] Tests updated.
- [x] Documentation added.
- [x]
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
. - [ ]
about-versioning.md
updated with experimental features.
Would this cause a thundering herd effect for all requests that hit the same distributor replica - they'd all retry at the same time?
Maybe @ying-jeanne can pitch in since she did some experiments with retry behavior
@dimitarvdimitrov is right, the current behaviour is not really as described. retry-after is guaranteed always random to prevent the problem described above ⬆️ . When the retry attemps <= max retry attemps: retryafter = random[2^(retry-1), 2^retry] And when the retry attemps > max retry attemps: retryafter = random[2^(max_retry-1), 2^max_retry]
In your case, how circuitBreakerRemainingDelay is calculated or decided to set? @duricanikolic
If circuitBreakerRemainingDelay itself doesn't have any randomness, I am thinking: after minSeconds and maxSeconds calculated with the previous logic, just add in the end of the function
if circuitBreakerRemainingDelay > minSeconds && circuitBreakerRemainingDelay < maxSeconds {
retry = random[circuitBreakerRemainingDelay, maxSeconds]
} else {
retry = random[minSeconds, maxSeconds]
}
Since when circuitBreakerRemainingDelay is even smaller than the minSeconds, no need to consider it. When it is even higher than maxSeconds, we don't want to consider it, since it might goes even more than the timeout. The only case we consider it would be when it falls in between, in this case we send a random number between circuitBreakerRemainingDelay and maxSeconds. Maybe this could a solution for the problem you are solving now.
The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main
and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!