semian icon indicating copy to clipboard operation
semian copied to clipboard

EDIT: This branch is the iteration target. Implementing a per-host circuit breaker state using shared memory and semaphores

Open kyewei opened this issue 10 years ago • 8 comments

@Sirupsen @csfrancis @byroot for review, suggest refactoring if needed!

Implementation details: Shared memory stores the space of int+int+double[], corresponding to @successes, @errors.count, and the @errors array itself (stored as doubles). The array has a max size of @errors.count since the Ruby code ensures the array never grows above that size.

  • Most of semaphore initialization, permissions, etc were borrowed from semian.c.
  • A regular array (as opposed to a linked list) was chosen because of the small array length and simplicity.
  • Shared memory is not explicitly marked for deletion from the C side since (1) it's on the order of ten's of bytes, (2) it won't increase if the same key/id is used to create the memory. This is the same(?) rationale as why in semian.c, semian_resource_free() doesn't clear the semaphores, instead leaving them to be cleared on reboot.

PR v2 (10/06/2015, up to hash 1f0960b...)

I did quite a lot of refactoring, so here's a detailed overview.

  • The final structure looks like this: Semian::SharedMemoryObject has two subclasses, Semian::AtomicInteger and Semian::SlidingWindow. I took suggestions to abstract the memory management further so that Semian::CircuitBreaker would only be using instances of the specific subclasses. I also added header files so code can be shared.
  • Semian::AtomicEnum subclasses Semian::AtomicInteger to provide shared memory for CircuitBreaker's @state. It runs like a traditional C-style enum type thats backed by integers. It's used in @state's :closed, :open, :half_open
  • SharedMemoryObject provides and manages the semaphore, shared memory, and the subclasses implement the use cases for the memory. Constructor calls look like this: SharedMemoryObject.new("name", [:int, :int, :long, :long], 0660) and specify the memory structure. The associated C structure has a reference to function of type void (*object_init_fn)(size_t, void *, void *, size_t, int) so a custom data initializer can be used for each type
  • I thought about using extend Forwardable for delegating and decorating but that would introduce some complications with binding C functions to objects (things like AtomicInteger#value would need an extra argument to access SharedMemoryObject functions)
  • I exposed a SharedMemoryObject#execute_atomically to allow custom actions such as those found in CircuitBreaker#push_time without providing #lock and #unlock explicitly
  • Dynamic resizing and initialization for SlidingWindow now works properly. The following scenarios resize:
    • Worker 1 uses memory of size=n; worker 2 is created, requesting memory of size!=n; both workers will use new size, with data copied over and reinitialized using object_init_fn(). This occurs when you restart workers with new configurations
    • #resize_to is called with new size on a worker, every worker will resize to new size
  • When initializing, memory is 0-initialized only if the worker is the only worker attached to it. Otherwise, memory resizing/copying is done. Use case: if one worker crashes, it can be restarted without losing data. However if all workers are restarted, it makes sense to reset the configuration to prevent some edge case from occurring.
  • Added tests for the shared memory. Looking for suggestions for more tests / or if there are better ways to do things than what I've done.
  • Fixed bugs/error locations

I'm looking for suggestions for how to structure the fallback ruby implementation so that both the fallback and the shared memory versions of AtomicInteger, SlidingWindow, etc can coexist. Right now C code hooks and replaces them if semaphores are enabled, i.e. AtomicInteger is either semaphore-supported or it isn't, and theres no way to toggle between them.

Also, if I missed anything, comment below.

kyewei avatar Sep 28 '15 21:09 kyewei

This needs some tests around (at least):

  • Two workers in two different processes sharing the CB (and the different scenarios that follow, e.g. circuit break trips, etc)
  • Forcefully killing a worker that might have the semaphore releases it (fuzzy)
  • Resizing the queue (e.g. two workers starting with different window sizes)
  • Starting a worker when the circuit breaker has failures in the sliding window should not reset it

Other than that I'm assuming you're borrowing from the existing test suite?

sirupsen avatar Sep 29 '15 13:09 sirupsen

Shared memory is not explicitly marked for deletion from the C side

Agree. We also don't need to use IPC_RMID for the same reason.

sirupsen avatar Sep 29 '15 13:09 sirupsen

In a follow-up we'll have to figure out how to make this part of Instrumentable so we can monitor the circuit breaker stats across all hosts.

sirupsen avatar Sep 29 '15 13:09 sirupsen

I'd add another comment but more general: Please avoid abbreviations, like cb, just use circuit_breaker instead.

byroot avatar Sep 29 '15 15:09 byroot

\cc @charliesome @jnunemaker you'll be interested in this for your adoption.

sirupsen avatar Sep 29 '15 15:09 sirupsen

So to sum up my initial code review, questions/changes I'd like to see are:

  • Merge as much code as possible with semian.c and compile them into the same shared object.
  • Simplify scope of C code by providing only the few primitives necessary for a sliding window with n counters (e.g. successes, size).
  • Fall back to a sliding window implementation that's completely in Ruby and works in just the memory of that Ruby VM.
  • Enable resizing the size of the sliding window dynamically.
  • A fresh worker (deploy, worker killed, ..) coming up in the same SysV namespace should not reset the circuit breaker.
  • Use integers and not floats to store the timestamps.
  • Monitoring the circuit breaker parameters (can be addressed in follow-up).
  • Improve error handling and return values from functions.
  • Support a fallback for semtimedop on platforms that don't support it (can be addressed in follow-up).
  • Needs a test suite that's specific to the worker's now sharing state (see my comment)

sirupsen avatar Sep 29 '15 15:09 sirupsen

I updated the PR with new changes. See above.

kyewei avatar Oct 06 '15 18:10 kyewei

@kyewei is going to split this into some smaller PRs since this is too hard to review.

sirupsen avatar Oct 22 '15 18:10 sirupsen