semian icon indicating copy to clipboard operation
semian copied to clipboard

Error rate circuit breaker

Open RyanAD opened this issue 4 years ago • 7 comments

This PR adds a circuit breaker that opens when the error percentage crosses some threshold (ex, 25% of calls failed in the last 10 seconds), also discussed here: https://github.com/Shopify/semian/issues/245. Ideally trying to solve for the problem mentioned in this comment: https://github.com/Shopify/semian/issues/245#issuecomment-508565807 . We have unpredictable spikes in traffic, which make tuning an absolute # of errors in a time window difficult.

There are some methods that technically violate the DRY principle, in particular the acquire and maybe_with_half_open_resource_timeout methods on lib/semian/error_rate_circuit_breaker.rb and lib/semian/circuit_breaker.rb are the same. I'm open to suggestions on if this should be fixed, and the best way to do so.

This has not been battle-tested in a high volume production environment yet, and I'm still working on documentation. Also working on using Timecop in the new unit tests.

RyanAD avatar Apr 02 '20 01:04 RyanAD

I really like the concept of this PR at a high level. It would be amazing to have a more intuitive way to configure the circuit breaker.

However, I don't think that this change has actually makes it more intuitive (see comments below). I'm open to alternative approaches or suggestions. Please let me know what you think.

I really like the concept of this PR at a high level. It would be amazing to have a more intuitive way to configure the circuit breaker.

However, I don't think that this change has actually makes it more intuitive (see comments below). I'm open to alternative approaches or suggestions. Please let me know what you think.

I was talking about this with a colleague, and we both agreed it would be better to specifically specify which circuit breaker implementation you want instead of determining based on the options passed in. I plan on making that change. I'm currently working on other things at the moment, but should have time to cycle back on this in the next week.

RyanAD avatar Apr 09 '20 21:04 RyanAD

I really like the concept of this PR at a high level. It would be amazing to have a more intuitive way to configure the circuit breaker.

However, I don't think that this change has actually makes it more intuitive (see comments below). I'm open to alternative approaches or suggestions. Please let me know what you think.

Thank you for the suggestions. I just pushed an implementation that calculates the time spent in error vs success. Please let me know your thoughts.

RyanAD avatar Apr 24 '20 21:04 RyanAD

Hi @RyanAD any progress on this one?

miry avatar Jun 10 '22 09:06 miry

We've been successfully using this within Instacart for a while now. Let me confirm we don't have any changes that should be incorporated here. I'll follow up by the end of next week.

RyanAD avatar Jun 10 '22 22:06 RyanAD

I pushed two commits for code cleanup and a small thread safety fix in TimeSlidingWindow. This code has been running successfully within Instacart for several months. What are next steps for merging?

RyanAD avatar Jun 22 '22 22:06 RyanAD

@RyanAD next steps:

  • Add changelog line
  • Rebase and squash against master branch

miry avatar Jun 23 '22 09:06 miry

@miry is this still valid? What's missing, beside the conflict, to push this out?

nap avatar Dec 05 '22 19:12 nap