semian
semian copied to clipboard
EDIT: This branch is the iteration target. Implementing a per-host circuit breaker state using shared memory and semaphores
@Sirupsen @csfrancis @byroot for review, suggest refactoring if needed!
Implementation details:
Shared memory stores the space of int+int+double[], corresponding to @successes, @errors.count, and the @errors array itself (stored as doubles). The array has a max size of @errors.count since the Ruby code ensures the array never grows above that size.
- Most of semaphore initialization, permissions, etc were borrowed from
semian.c. - A regular array (as opposed to a linked list) was chosen because of the small array length and simplicity.
- Shared memory is not explicitly marked for deletion from the C side since (1) it's on the order of ten's of bytes, (2) it won't increase if the same key/id is used to create the memory. This is the same(?) rationale as why in
semian.c,semian_resource_free()doesn't clear the semaphores, instead leaving them to be cleared on reboot.
PR v2 (10/06/2015, up to hash 1f0960b...)
I did quite a lot of refactoring, so here's a detailed overview.
- The final structure looks like this:
Semian::SharedMemoryObjecthas two subclasses,Semian::AtomicIntegerandSemian::SlidingWindow. I took suggestions to abstract the memory management further so thatSemian::CircuitBreakerwould only be using instances of the specific subclasses. I also added header files so code can be shared. Semian::AtomicEnumsubclassesSemian::AtomicIntegerto provide shared memory forCircuitBreaker's@state. It runs like a traditional C-style enum type thats backed by integers. It's used in@state's:closed,:open,:half_openSharedMemoryObjectprovides and manages the semaphore, shared memory, and the subclasses implement the use cases for the memory. Constructor calls look like this:SharedMemoryObject.new("name", [:int, :int, :long, :long], 0660)and specify the memory structure. The associated C structure has a reference to function of typevoid (*object_init_fn)(size_t, void *, void *, size_t, int)so a custom data initializer can be used for each type- I thought about using
extend Forwardablefor delegating and decorating but that would introduce some complications with binding C functions to objects (things likeAtomicInteger#valuewould need an extra argument to accessSharedMemoryObjectfunctions) - I exposed a
SharedMemoryObject#execute_atomicallyto allow custom actions such as those found inCircuitBreaker#push_timewithout providing#lockand#unlockexplicitly - Dynamic resizing and initialization for
SlidingWindownow works properly. The following scenarios resize:- Worker 1 uses memory of
size=n; worker 2 is created, requesting memory ofsize!=n; both workers will use new size, with data copied over and reinitialized usingobject_init_fn(). This occurs when you restart workers with new configurations #resize_tois called with new size on a worker, every worker will resize to new size
- Worker 1 uses memory of
- When initializing, memory is 0-initialized only if the worker is the only worker attached to it. Otherwise, memory resizing/copying is done. Use case: if one worker crashes, it can be restarted without losing data. However if all workers are restarted, it makes sense to reset the configuration to prevent some edge case from occurring.
- Added tests for the shared memory. Looking for suggestions for more tests / or if there are better ways to do things than what I've done.
- Fixed bugs/error locations
I'm looking for suggestions for how to structure the fallback ruby implementation so that both the fallback and the shared memory versions of AtomicInteger, SlidingWindow, etc can coexist. Right now C code hooks and replaces them if semaphores are enabled, i.e. AtomicInteger is either semaphore-supported or it isn't, and theres no way to toggle between them.
Also, if I missed anything, comment below.
This needs some tests around (at least):
- Two workers in two different processes sharing the CB (and the different scenarios that follow, e.g. circuit break trips, etc)
- Forcefully killing a worker that might have the semaphore releases it (fuzzy)
- Resizing the queue (e.g. two workers starting with different window sizes)
- Starting a worker when the circuit breaker has failures in the sliding window should not reset it
Other than that I'm assuming you're borrowing from the existing test suite?
Shared memory is not explicitly marked for deletion from the C side
Agree. We also don't need to use IPC_RMID for the same reason.
In a follow-up we'll have to figure out how to make this part of Instrumentable so we can monitor the circuit breaker stats across all hosts.
I'd add another comment but more general: Please avoid abbreviations, like cb, just use circuit_breaker instead.
\cc @charliesome @jnunemaker you'll be interested in this for your adoption.
So to sum up my initial code review, questions/changes I'd like to see are:
- Merge as much code as possible with
semian.cand compile them into the same shared object. - Simplify scope of C code by providing only the few primitives necessary for a sliding window with
ncounters (e.g. successes, size). - Fall back to a sliding window implementation that's completely in Ruby and works in just the memory of that Ruby VM.
- Enable resizing the size of the sliding window dynamically.
- A fresh worker (deploy, worker killed, ..) coming up in the same SysV namespace should not reset the circuit breaker.
- Use integers and not floats to store the timestamps.
- Monitoring the circuit breaker parameters (can be addressed in follow-up).
- Improve error handling and return values from functions.
- Support a fallback for
semtimedopon platforms that don't support it (can be addressed in follow-up). - Needs a test suite that's specific to the worker's now sharing state (see my comment)
I updated the PR with new changes. See above.
@kyewei is going to split this into some smaller PRs since this is too hard to review.