semian icon indicating copy to clipboard operation
semian copied to clipboard

[DRAFT] Share circuit breakers between workers

Open michaelkipper opened this issue 5 years ago • 5 comments

This PR fixes #27.

What

Moves the implementation of circuit breakers to shared memory so circuit errors can be shared between workers on the same host.

Why

Currently, for single-threaded workers, we need to see error_threshold consecutive failures in order to open a circuit. For applications where timeouts are necessarily high (e.g. 25 seconds for MySQL) that translates into up to 75 seconds of blocking behaviour when a resource is unhealthy.

If all workers on a host experience the outage simultaneously, that information can be shared and the circuit can be tripped in a single timeout iterations, increasing the speed of detecting the unhealthy resource and opening the circuit.

In addition, in the case of a variable latency outage of an upstream resource, the collective hivemind can share information about the increased latency and detect trends earlier and more reliably. This feature is not implemented yet, but becomes possible after this change.

How

Note: This is very much a draft, and needs a lot of refactoring, but it's a start.

The main idea is ~to move the data for Simple::Integer and Simple::SlidingWindow to shared memory~ to move the data for Simple::Integer to a semaphore and `Simple::SlidingWindow to shared memory.

Simple::State simply uses the shared Simple::Integer as its backing store to easily share state between workers on the same host.

Feel free to take a look around, but I've still got some refactoring to do.

~In particular, this implementation isn't thread-safe yet and requires some serious locking.~

cc: @Shopify/servcomm cc: @sirupsen, @csfrancis, @byroot

michaelkipper avatar Jun 12 '19 19:06 michaelkipper

If we have this, I'm fairly sure we should also decouple bulkheads from circuits. At the moment (or at least as of the end of 2018), a bulkhead acquisition failure marks an error in the same resource's circuit. There's a lengthy discussion in https://github.com/Shopify/shopify/pull/178393. I'll reproduce the big points for open source sake. At the end, my conclusion was that sharing circuits between processes and decoupling bulkheads would be an attractive option, which is possible after this PR!

@ashcharles

If Semian is unable to acquire a ticket due to bulkheading, an error is thrown. Multiple errors may exceed the configured error_threshold for the resource causing the circuit to open. While this behaviour is tested in Semian, document it explicitly here with a test for the case of Redis to make it clear that repeated bulkheading on Redis may lead to open-circuit errors.

I found this behaviour surprising as did others in the ensuing discussion. I chose code > documentation. I thought that adding a test to core would make it clear for folks who aren't necessarily looking at the Semian code-base.

@jpittis

TLDR: I can't think of a reason why (besides backward compatibility) that we wouldn't want to remove this behaviour from Semian.

This is a really good test case to articulate an important property of Semian.

Bulkheads triggering does not mean a protected service is experiencing any kind of issue. But we still open a circuit and fast fail requests for the duration of error timeout.

I suspect that this has caused us to fail requests to Redis (and other protected services) when the service was not under duress.

@pushrax

Thanks for bringing this up. I agree it may not be desired behaviour.

The high level goal of a circuit breaker is: "if a request is likely to fail, don't issue it." This is done for two reasons. First, and most importantly, it reduces unnecessary load on the dependency, which is particularly crucial when the dependency is failing due to overload. Second, it reduces unnecessary load on the dependent, which is crucial when the dependency is timing out and taking up too much dependent capacity.

The relevant property to notice is: a circuit breaker is not useful when the cost of a failed request is similar to the cost of evaluating the circuit breaker. When a bulkhead is full, further requests will fail without hitting the network at all, thus failure is cheap. Circuit breakers aren't providing their intended improvement when applied to bulkhead failure. I agree it makes sense to remove this behaviour, it complicates thinking about tuning. An oscillating distributed bulkhead is even more confusing to reason about than a regular distributed bulkhead.

@sirupsen

I am not sure I am sold on removing this behaviour. I've always thought the interaction between them was useful. If tickets are 50% of workers and timeouts are high you'll be "wasting" 50% of workers on timeouts to the resource. After 3 x error_threshold you'd be back to having 100% of capacity (if the half_open_resource_timeout is reasonable).

Imagine having a 25s timeout on MySQL. When all tickets are taken, clearly that MySQL is a problem and we trigger circuits rapidly. If you waited only for the circuits, you'd have to wait 3 x 25s. If you had only bulkheads, you're "wasting" 50% of capacity. If you do both, you seem to get the best of both worlds (as long as you have the half open timeout).

@pushrax

I think this is a case where both positions are correct depending on the tuning and scenario. Simulation would be a great way to show exactly how this is true.

I keep forgetting circuits are not shared between processes, while bulkheads are. Your idea ends up working around this by sharing state between processes through the bulkheads to influence the circuits.

Vertical axis is resource usage (in tickets), horizontal is time: 6v1ebxycqeq5mpd9t2iteg_thumb_9c3

Assume t0 is the resource timeout. My earlier comment about tripping the circuit not being useful here was with reference to the overlapping time between the marked t0 interval and the breaker open interval. In that case, the bulkhead will prevent further requests anyway. In the remaining time (if any) of the breaker open interval, we avoid queuing more work. t1 is showing the "half open timeout" logic.

What about the case when t0 < timeout?

nq3bolr te6g3p2mdek0cg_thumb_9c3

When all tickets are taken, clearly that MySQL is a problem and we trigger circuits rapidly.

Downstream latency regression is one reason tickets can be exhausted, but upstream demand and small amounts of queuing can also cause this. If a large amount of load appears, filling tickets, we prematurely assume instability by shutting down the resource entirely. We fail requests that would not have failed.

When we trip circuits early based on bulkhead errors, we gain capacity for other work. If the other work can't use this capacity more effectively than the original work, we are making an error. This is the part that's difficult to understand: how likely is it that making capacity available for other work will have higher ROI given we just know a bulkhead is full? I don't have a good intuition for this.

Bulkheads can be tuned such that even when exhausted, there is sufficient capacity for other work, provided other competing bulkheads aren't concurrently exhausted. Is this not the way we use them?

Would sharing circuit state between processes be a more direct way of addressing the goals here without needing to understand the statistics of production failure modes?

pushrax avatar Jun 12 '19 21:06 pushrax

My big concern with this implementation is the number of shared memory segments that will be created. It looks like we're using distinct shared memory segments for each instance of Simple::Integer and Simple::SlidingWindow. For Shopify production, I could see this amounting to (tens of?) thousands of shared memory segments. I can't say for certain that this would be a problem, but it feels a bit sketchy to me and it would be good to verify that it's not.

csfrancis avatar Jun 13 '19 19:06 csfrancis

My big concern with this implementation is the number of shared memory segments that will be created... For Shopify production, I could see this amounting to (tens of?) thousands of shared memory segments.

@csfrancis: The data I have for core suggests that it's hundreds, not thousands: https://shopify.datadoghq.com/dashboard/mzg-id4-5rp/semian-lru?screenId=mzg-id4-5rp&screenName=semian-lru&tile_size=m&from_ts=1560376147986&to_ts=1560462547986&live=true&fullscreen_widget=812985505059370&fullscreen_section=overview

Our max_size LRU cache should put bounds on this.

michaelkipper avatar Jun 13 '19 21:06 michaelkipper

@csfrancis has some legitimate concerns about the number of shared memory segments this PR would create. SHMMNI is 4,096 on Linux so with our LRU max_size of 500, we'd need 1,500 of them.

Scott suggested storing all the sliding windows in a single shared memory segment, but I have concerns with this approach. Specifically, complexity that would be required to manage the indexing into that data structure. Right now, we generate a key and the IPC subsystem handles the lookup.

Another difficulty is cleanup. Given that we set IPC_RMID and call shmdt when a sliding window goes out of scope, we can defer cleanup to the kernel. If we had a single memory segment, we'd have to manage all that ourselves.

For SimpleInteger, the size of a shared memory segment on Linux is at least one page (4kB) so storing SimpleInteger in a shared memory segment is extremely inefficient. SEMMNI is 32,000 on Linux, so storing SimpleIntegers as a semaphore is a reasonable choice (as @sirupsen questioned earlier). I'll try and implement that ASAP.

michaelkipper avatar Jun 30 '19 19:06 michaelkipper

@csfrancis was concerned last week that the host-based circuits PR would require too many shared memory segments. The default on Linux is 4,096 and we were going to be using somewhere in the neightborhood of 1,500 with a max_size LRU of 500. He suggested using a single shared memory segment for all the circuit data.

I took a long stab at it, but ultimately concluded that it was too hard to do garbage collection. With semaphores, it's easy to do garbage collection with SEM_UNDO (in SysV IPC). With shared memory, we can set IPC_RMID to have the segment be destroyed after the last process detaches from it. But with a single segment for all the circuits, we'd have to manage garbage collection within the segment ourselves. I don't think it's impossible, but it adds a large amount of complexity to an already complex PR.

To alleviate concerns about the number of shared memory segments, I converted the Simple::Integer to use a semaphore, since there are 32k semaphore sets available, which leaves the number of shared memory segments required at 500 since only the SlidingWindow is using them.

As far as shipping this is concerned, I have https://github.com/Shopify/shopify/pull/206065 to bump core to use the new branch, with the previous behaviour. Then https://github.com/Shopify/semian/pull/243 adds support for an environment variable SEMIAN_CIRCUIT_BREAKER_FORCE_HOST which enables host-based circuits based on machine name (so we can enable it on a single node). Once we validate the behaviour on that node, we can move to a percentage rollout.

michaelkipper avatar Jul 03 '19 15:07 michaelkipper