semian
semian copied to clipboard
Feature Request: Throttle half_open -> closed attempts
What
Currently, when the error_timeout
expires, the next acquisition request for a circuit will cause a transition from open
to half_open
. In this state, workers will attempt to access the resource with a modified timeout of half_open_resource_timeout
. The motivation here is that the modified timeout is much lower than the client timeout so if the resource is still unhealthy, it will fail fast(er).
In the current implementation, every available worker (subject to the bulkhead configuration) will attempt the half_open
-> closed
transition. This means that if the resource is still unhealthy, all the workers could potentially block for half_open_resource_timeout
seconds, reducing overall node capacity.
Mathematically, this means that t[half-open] / (t[half-open] + t[error_timeout])
will be spent attempting to re-open the circuit. If t[half-open]
is 1.0s and t[error-timeout]
is 5.0s (our MySQL defaults) then 16.7% of our capacity will go toward re-opening the circuit. If bulkheads are in place with a quota of 0.5, that number will be 8.3%.
How
When a circuit opens, the number of available tickets should immediately drop to 1. This shields the rest of the workers from this unhealthy resource. This is marginally faster than the open circuit error, since bulkhead acquisition is attempted before circuit-breaker acquisition, but that's likely not a big deal.
When the transition happens from open
to half_open
, we can raise the number of available tickets to success_threshold
, to allow parallel re-closing of the circuit. Once the circuit is finally re-closed, we can raise the number of available tickets back to the original tickets/quota
value.