semian icon indicating copy to clipboard operation
semian copied to clipboard

Feature Request: Throttle half_open -> closed attempts

Open michaelkipper opened this issue 5 years ago • 0 comments

What

Currently, when the error_timeout expires, the next acquisition request for a circuit will cause a transition from open to half_open. In this state, workers will attempt to access the resource with a modified timeout of half_open_resource_timeout. The motivation here is that the modified timeout is much lower than the client timeout so if the resource is still unhealthy, it will fail fast(er).

In the current implementation, every available worker (subject to the bulkhead configuration) will attempt the half_open -> closed transition. This means that if the resource is still unhealthy, all the workers could potentially block for half_open_resource_timeout seconds, reducing overall node capacity.

Mathematically, this means that t[half-open] / (t[half-open] + t[error_timeout]) will be spent attempting to re-open the circuit. If t[half-open] is 1.0s and t[error-timeout] is 5.0s (our MySQL defaults) then 16.7% of our capacity will go toward re-opening the circuit. If bulkheads are in place with a quota of 0.5, that number will be 8.3%.

How

When a circuit opens, the number of available tickets should immediately drop to 1. This shields the rest of the workers from this unhealthy resource. This is marginally faster than the open circuit error, since bulkhead acquisition is attempted before circuit-breaker acquisition, but that's likely not a big deal.

When the transition happens from open to half_open, we can raise the number of available tickets to success_threshold, to allow parallel re-closing of the circuit. Once the circuit is finally re-closed, we can raise the number of available tickets back to the original tickets/quota value.

michaelkipper avatar Jul 02 '19 14:07 michaelkipper