semian
semian copied to clipboard
Error rate circuit breaker
This PR adds a circuit breaker that opens when the error percentage crosses some threshold (ex, 25% of calls failed in the last 10 seconds), also discussed here: https://github.com/Shopify/semian/issues/245. Ideally trying to solve for the problem mentioned in this comment: https://github.com/Shopify/semian/issues/245#issuecomment-508565807 . We have unpredictable spikes in traffic, which make tuning an absolute # of errors in a time window difficult.
There are some methods that technically violate the DRY principle, in particular the acquire
and maybe_with_half_open_resource_timeout
methods on lib/semian/error_rate_circuit_breaker.rb and lib/semian/circuit_breaker.rb are the same. I'm open to suggestions on if this should be fixed, and the best way to do so.
This has not been battle-tested in a high volume production environment yet, and I'm still working on documentation. Also working on using Timecop in the new unit tests.
I really like the concept of this PR at a high level. It would be amazing to have a more intuitive way to configure the circuit breaker.
However, I don't think that this change has actually makes it more intuitive (see comments below). I'm open to alternative approaches or suggestions. Please let me know what you think.
I really like the concept of this PR at a high level. It would be amazing to have a more intuitive way to configure the circuit breaker.
However, I don't think that this change has actually makes it more intuitive (see comments below). I'm open to alternative approaches or suggestions. Please let me know what you think.
I was talking about this with a colleague, and we both agreed it would be better to specifically specify which circuit breaker implementation you want instead of determining based on the options passed in. I plan on making that change. I'm currently working on other things at the moment, but should have time to cycle back on this in the next week.
I really like the concept of this PR at a high level. It would be amazing to have a more intuitive way to configure the circuit breaker.
However, I don't think that this change has actually makes it more intuitive (see comments below). I'm open to alternative approaches or suggestions. Please let me know what you think.
Thank you for the suggestions. I just pushed an implementation that calculates the time spent in error vs success. Please let me know your thoughts.
Hi @RyanAD any progress on this one?
We've been successfully using this within Instacart for a while now. Let me confirm we don't have any changes that should be incorporated here. I'll follow up by the end of next week.
I pushed two commits for code cleanup and a small thread safety fix in TimeSlidingWindow. This code has been running successfully within Instacart for several months. What are next steps for merging?
@RyanAD next steps:
- Add changelog line
- Rebase and squash against master branch
@miry is this still valid? What's missing, beside the conflict, to push this out?