timers
timers copied to clipboard
Huge performance improvements, using d_heap gem.
n.b. this is more request for comments (with code) than a draft PR.
Over the holidays, I looked into making a PR to use a min-heap priority queue. But then I got bogged down in benchmarks and C extensions and SIMD vector intrinsics for over a month, and I totally forgot about my timers
branches! Just today I thought I'd look into cleaning it up for a PR and I saw @WJWH beat me to it with #74! 😆👍🏻 #74 does some of what I was trying to do, without using a C extension.
I'll need to rebase my changes on master
to see how they compare against the new implementation and with the new performance specs. Apologies for any gratuitous formatting changes that accidentally sneaked into my branches. I intend to remove those.
Description
Before implementing my min-heap, I wanted to get a fair comparison with the existing algorithm by refactoring it to use Array#bsearch_index
(in my nevans:perf-bsearch
branch). This is a huge improvement. Although it's still technically O(n)
for push—Array#insert
may need to memcpy a significant portion of the array—memcpy
is so incredibly fast on modern hardware that the comparisons wind up dominating for N < 10000. Anything to avoid jumping from C code into ruby code during the comparisons sped it up. E.g. using alias to_f time
instead of def to_f; @time; end
is a significant part of this improvement. Combined with O(1)
pop, and this implementation is fantastic for smaller queues.
Then I attempted a pure ruby heap. @WJWH did it better in #74, so I won't say much about my version. :) While I did achieve an improvement over the original, I was deeply disappointed in the performance (for N<50k) compared to the much simpler Array#bsearch_index
algorithm. The pure ruby heap I use in my d_heap
benchmarks is faster than the version in that branch.
So then I looked into C extensions. I did find a couple of other priority queue ruby gems, one of them used the C++ STL std::priority_queue
, but I wound up getting a bit obsessive about this and wrote my own: d_heap, which implements a d-ary heap. If you look at the benchmarks in the README, I think you'll get a notion of the relative scale between a pure ruby min heap, Array#bsearch
, and d_heap
. d_heap
runs considerably faster for every size queue. It's even faster than the priority_queue_cxx gem's C++ std::priority_queue
implementation (at least on my system).
Both go and libev use a 4heap, but I was surprised to find that—at least with my benchmarks on modern hardware and with modern compilers—d=4 was not the ideal d value. I chalk this up to CPU branch prediction and gcc -O3
loop unrolling (etc). On my system, I discovered (with AVX2 enabled) d=12 was generally fastest. But I need to do more benchmarks on other systems to discover an appropriate default.
d_heap is still missing a few bits before it's production ready. At the very least:
- [ ] shrink the C array when items are popped!!!
- [ ] a simple JRuby version using
java.util.PriorityQueue
. - [ ] a fallback pure ruby version for systems which can't compile the C extension
- [ ] test more hardware for a generically suitable default
d
- [ ] test 32-bit hardware for compatibility
I went a bit overboard learning about SIMD vectors and playing with SSE/AVX2/AVX512 and C macros in one of my unreleased branches. I did push it to even faster performance than gcc's -O3
with -funroll-loops
, etc, but I completely forgot about cleaning it all up for production release! 😆
Types of Changes
- Performance improvement.
Testing
- [x] I added tests for my changes
- [x] extensive tests in the d_heap gem
- [x] increased numbers on existing perf spec to clarify differences
- [ ] intended to add a couple of specs similar to 92b2a81d to simulate realistic timer scenarios
- [x] I tested my changes locally.
- [ ] I tested my changes in staging.
- [ ] I tested my changes in production.
Other variations to consider
go
is very smart about lazily updating their timers heap, keeping a counter of all deleted or updated timers and flagging the individual timers. If you delete/adjust timers lazily, then most timers can be rescheduled before they incur any cancellation + re-insertion costs.
One variation I intended to benchmark but never got around to: use the bsearch
implementation for smaller queues, but dynamically converting to a min-heap when it gets large. A sorted array already satisfies the heap property, so the only tricky part is tuning when to sort!
and convert back to bsearch
if/when the queue shrinks and stays small.
Another variation relies on 1) all timers with the same timeout can be handled with a FIFO, 2) CRuby's Array#shift
is special-cased to allow O(1)
queue behavior, and 3) the assumption that most timers will have integer timeouts of less than 5 minutes. So we could create an Array of timeouts => timers, e.g. Array.new(5 * 60) { [] }
, with two additional priority queues: the first for any other timers (e.g. deadline-based, float duration, or longer than 5 minutes), and the second would select the minimum. The merged priority queue would hold no more than 301 items, so bsearch
+ insert
would be super fast (d_heap
would still be 4x faster). Any integer timeout < 5 minutes could be effectively O(1)
.
Another variation (I saw this in a python library, I forget which one) relies on the assumption that most timers are timeouts that we expect to be canceled (the operation will succeed before timing out). To that end, maintain both "near" and "far" collections. The "near" collection can be a min-heap, but the far collection is simply a Set
or a Map
with an ivar for "far_deadline". Whenever the "far_deadline" is "near", the map will be scanned (O(n)
) for anything below the newly extended "far_deadline". The idea here is to get O(1) cancellation for most timers, because canceling timers is more common than firing them.
But IMO the two previous variations are both pointing towards various sorts of timing wheels. The Linux process scheduler uses one version with reduced precision based on how far away the deadline is when it's scheduled. And kafka uses hierarchical timing wheels. With timing wheels, we could effectively achieve O(1) for push, pop, and cancel. But timing wheels are conceptually much more complex than a simple min-heap. I figured, if go
s timers can get by with a minheap, it's probably fine to just start there.
I think you've come to similar conclusions to me - the overhead of memcpy
only becomes an issue once the queue gets excessively large.
My response to that was the pending list of timers, many which are never actually invoked in practice.
Regarding the number of timers, it would be hard to imagine there ever being more than 10,000 timers at any one time, given other limitations of the system.
That being said, I agree that using a proper heap makes a lot of sense, so I think we should continue to explore this direction.
One option to consider is whether it would make more sense to introduce a priority queue implementation to Ruby itself. There are many places where I think this could be useful and I'd be happy to help with getting it merged. The other option is a gem... which is also okay.
@WJWH can I get your feedback on this?
Regarding the number of timers, it would be hard to imagine there ever being more than 10,000 timers at any one time, given other limitations of the system.
Yeah, my current largest production deployment is still using... eventmachine. EM's timers use C++ STLstd::multimap
(i.e. rb-tree). The timers are not even remotely my bottleneck!
it would make more sense to introduce a priority queue implementation to Ruby itself
IMO, yes! 😄
If I remember correctly, Matz prefers (and I agree) that stdlib classes should represent abstract data types more than specific data structure implementations. But more than once, I've been annoyed that stdlib's SortedSet
is completely broken without the rbtree
gem (which was unmaintained until very recently). SortedHash would be nice to have in standard lib too. But with 3.0's new Fiber.scheduler
, the absence of a PriorityQueue
class is an especially noticeable omission. It'd be useful beyond just scheduling, too.
I actually started out by attempting to mirror the style of Array#sort
, Array#sort!
, Array#bsearch
and Array#bsearch_index
with a similar Array#heapify
, Array#heapify!
, Array#heap_push
, and Array#heap_pop
. This is (sort-of) what python's heapq
does. But when the options to those methods and the number of methods started increasing, I switched to a dedicated priority queue class.
And then I discovered how slow comparisons with <=>
are. In the worst case, a heap needs more comparisons per combined push + pop ((1 + d) log n / log d) than the bsearch
+ insert
approach (log n / log 2), so it's critical than those run fast. I eventually figured out how to partially implement the same comparison optimizations used by several Enumerable
methods in stdlib. But if you can stay in C with simple inline numeric comparisons, then either gcc's optimizations or CPU branch predictions take over and make everything so much faster (even without my crazy macro-unrolled SIMD loops). So I ultimately decided to only allow Float scores and revisit more generic comparisons later.
The benefit of making timers super efficient is that you end up using them everywhere without concern. That’s part of the reason why I made the pending list, it makes unused timers almost free.
100% agree that it'd be nice to just use a priority queue anywhere without any concern that it's the bottleneck. Priority Queues can be "hotspot" code for schedulers and other algorithms.
FWIW, I did grab the timers PriorityQueue and stack it up against the pure ruby implementations I'd been using for my d_heap benchmarks. These are all using random inputs; they're missing a more "realistic" benchmark to simulates a monotonically increasing timers heap, but I don't expect that to behave much differently from the "pushpop"
benchmark with random values.
My latest (pure ruby) priority queue benchmark: https://gist.github.com/nevans/6473e9fe1bf95930c5382d86db863b5b . My Rb2Heap
and Rb4Heap
both outperform the timers
priority queue by ~1.5x. The most important trick: only one assignment per layer (don't "swap"), and switching to a 4-heap with an unrolled loop for min-index is generally faster (sometimes just barely). Bsearch
crosses over somewhere between 100k and 10k.
Some excerpts
push N (N = 3,162,278)
Rb4Heap: 5205985.6 i/s
Rb2Heap: 3872629.7 i/s - 1.34x slower
Timers gem: 3415913.9 i/s - 1.52x slower
BSearch: 0.0 i/s
push N (N = 100,000)
Rb4Heap: 5528770.6 i/s
Rb2Heap: 4313109.2 i/s - 1.28x slower
Timers gem: 3415591.6 i/s - 1.62x slower
BSearch: 200902.2 i/s - 27.52x slower
push N (N = 31,623)
Rb4Heap: 5863469.5 i/s
Rb2Heap: 4416811.1 i/s - 1.33x slower
Timers gem: 3462349.2 i/s - 1.69x slower
BSearch: 563895.6 i/s - 10.40x slower
push N (N = 10)
Rb4Heap: 6312993.5 i/s
Rb2Heap: 5515562.1 i/s - 1.14x slower
BSearch: 4258213.7 i/s - 1.48x slower
Timers gem: 4226971.0 i/s - 1.49x slower
pushpop w/ size=3,162,278
Rb4Heap: 369297.9 i/s
Rb2Heap: 337428.7 i/s - 1.09x slower
Timers gem: 255441.3 i/s - 1.45x slower
BSearch: 0.0 i/s
pushpop w/ size= 100,000
Rb4Heap: 459019.5 i/s
Rb2Heap: 434168.4 i/s - 1.06x slower
Timers gem: 290205.6 i/s - 1.58x slower
BSearch: 228682.2 i/s - 2.01x slower
pushpop w/ size= 31,623
BSearch: 956823.2 i/s
Rb4Heap: 507833.1 i/s - 1.88x slower
Rb2Heap: 456219.1 i/s - 2.10x slower
Timers gem: 310593.7 i/s - 3.08x slower
pushpop w/ size= 10
BSearch: 3613651.1 i/s
Rb4Heap: 1616519.3 i/s - 2.24x slower
Rb2Heap: 1604217.9 i/s - 2.25x slower
Timers gem: 1111453.2 i/s - 3.25x slower
push N, pop N (N=3,162,278)
Rb4Heap: 812798.4 i/s
Rb2Heap: 745647.4 i/s - 1.09x slower
Timers gem: 553322.0 i/s - 1.47x slower
BSearch: 0.0 i/s
push N, pop N (N= 100,000)
Rb4Heap: 1108310.9 i/s
Rb2Heap: 1095127.5 i/s - 1.01x slower
Timers gem: 834546.7 i/s - 1.33x slower
BSearch: 401824.9 i/s - 2.76x slower
push N, pop N (N= 31,623)
Rb4Heap: 1261109.5 i/s
Rb2Heap: 1226813.8 i/s - 1.03x slower
BSearch: 1070724.9 i/s - 1.18x slower
Timers gem: 904562.3 i/s - 1.39x slower
push N, pop N (N= 10)
BSearch: 6166773.3 i/s
Rb4Heap: 3922461.5 i/s - 1.57x slower
Rb2Heap: 3920965.2 i/s - 1.57x slower
Timers gem: 3305228.9 i/s - 1.87x slower
(N values were chosen to be multiples of sqrt(10))
Great work on the benchmarks.
If we can make it an optional dependency, I'm fine to have it, it looks great. We still need to be careful with JRuby/TruffleRuby support, which is where a pure Ruby implementation is good.
I'm so sorry, I just completely fell off the map with all of this and a bunch of other little open source stuff I'd wanted to accomplish this year. I'm going to try and clean up some of my code prior to RubyConf in November. Would a heap written in a C extension or something like it still be useful in this gem (or could we even push something like it into stdlib in time for 3.1?)
Ah, and even if we don't go to a C extension, I just noticed (remembered) that I had a carefully optimized pure ruby implementation, linked in the gist above. I can create a PR with just that, too. I haven't run benchmarks in JRuby or TruffleRuby yet, but I expect this is the sort of benchmark they naturally handle very well with anyway.
I had a carefully optimized pure ruby implementation, linked in the gist above. I can create a PR with just that, too.
I think that's a good idea.