semian icon indicating copy to clipboard operation
semian copied to clipboard

Exponential backoff

Open jacobbednarz opened this issue 7 years ago • 2 comments

Something I've been looking into lately is how we can combat the stampeding herd effect we occasionally incur once a system has recovered and it is able to receive traffic again. One approach I've explored is using expotential backoff and I was looking to find out if this is something you'd consider adding to semian? I think semian is a sensible place to put this because it already has knowledge of the tickets/quotas, error rates and could use it's already available data to make decisions on how much to push out the backoff by without needing to query another resource.

Also open to hearing about how you've addressed this at Shopify if you've got a good handle on it in other ways 😄

jacobbednarz avatar May 29 '17 23:05 jacobbednarz

Do you mean for the circuit breakers? I would assume that increasing the size of the window (timeout) would cause this to decrease since there should be some randomness involved in when the windows would open. If it's a heavily queried resource, perhaps adding some randomness for jitter to the window could work?

By exponential backoff, are you referring to the size of the circuit breaker window, or something different? Is this a problem for your datastore due to the sudden throughput (would it ever be larger than steady state?) or the connections established per second? Which datastore are you running into trouble with?

BTW Jacob did you roll out the new Semian with quota for bulkheads? We're used it in production now for weeks, and it's 👌

cc @jpittis

sirupsen avatar May 30 '17 00:05 sirupsen

Yep, for the circuit breakers. I haven't tried adding any jitter to the window but could definitely trial some ideas on that instead. The issue we have is that when we bring MySQL back online there are bunch of services will be waiting on it and with cold caches, it is a big sluggish to respond. We've got a few things in the pipeline to mitigate it but I'm sure we'll hit it eventually with another datastore. I don't think it's getting overloaded with connections, just that everything will be hitting cold cache and it needs rebuilding.

No quotas yet but it's on my list to look at in the coming weeks 😄

jacobbednarz avatar May 30 '17 00:05 jacobbednarz