semian
semian copied to clipboard
Feature Request: Percentage Based Error Thresholds
What
Currently, we express error thresholds as the number of failures (error_threshold
) in a certain time period (error_timeout
). After that threshold is reached, we open the circuit, and only close it again after a certain number of successful requests (success_threshold
) are reached.
This requires intimate knowledge of your request patterns. A more flexible model is to use an error percentage threshold to determine when to open the circuit. Instead of saying 3 failures in 5 seconds, one might say over 10% of requests failed.
How
Either add a new parameter, error_percent_threshold
or allow error_threshold
to be expressed as a percentage (e.g. "10%"
).
Maintain either a large sliding window of successes and errors to compute percentages, or perhaps a set of counters to reduce the overall size of the windows.
This requires intimate knowledge of your request patterns
On top of that, it also assumes a happy path. I like a percentage threshold because it could adapt to a shift in traffic that wasn't anticipated (e.g., flash sale). This would work well with our current setup where bulkheads trip circuits, because bulkhead timeouts are more likely during high RPS, which in turn makes it more likely that we'd open circuits in high traffic situations (which may not be desirable).