timers Huge performance improvements, using d

n.b. this is more request for comments (with code) than a draft PR.

Over the holidays, I looked into making a PR to use a min-heap priority queue. But then I got bogged down in benchmarks and C extensions and SIMD vector intrinsics for over a month, and I totally forgot about my timers branches! Just today I thought I'd look into cleaning it up for a PR and I saw @WJWH beat me to it with #74! 😆👍🏻 #74 does some of what I was trying to do, without using a C extension.

I'll need to rebase my changes on master to see how they compare against the new implementation and with the new performance specs. Apologies for any gratuitous formatting changes that accidentally sneaked into my branches. I intend to remove those.

Description

Before implementing my min-heap, I wanted to get a fair comparison with the existing algorithm by refactoring it to use Array#bsearch_index (in my nevans:perf-bsearch branch). This is a huge improvement. Although it's still technically O(n) for push—Array#insert may need to memcpy a significant portion of the array—memcpy is so incredibly fast on modern hardware that the comparisons wind up dominating for N < 10000. Anything to avoid jumping from C code into ruby code during the comparisons sped it up. E.g. using alias to_f time instead of def to_f; @time; end is a significant part of this improvement. Combined with O(1) pop, and this implementation is fantastic for smaller queues.

Then I attempted a pure ruby heap. @WJWH did it better in #74, so I won't say much about my version. :) While I did achieve an improvement over the original, I was deeply disappointed in the performance (for N<50k) compared to the much simpler Array#bsearch_index algorithm. The pure ruby heap I use in my d_heap benchmarks is faster than the version in that branch.

So then I looked into C extensions. I did find a couple of other priority queue ruby gems, one of them used the C++ STL std::priority_queue, but I wound up getting a bit obsessive about this and wrote my own: d_heap, which implements a d-ary heap. If you look at the benchmarks in the README, I think you'll get a notion of the relative scale between a pure ruby min heap, Array#bsearch, and d_heap. d_heap runs considerably faster for every size queue. It's even faster than the priority_queue_cxx gem's C++ std::priority_queue implementation (at least on my system).

Both go and libev use a 4heap, but I was surprised to find that—at least with my benchmarks on modern hardware and with modern compilers—d=4 was not the ideal d value. I chalk this up to CPU branch prediction and gcc -O3 loop unrolling (etc). On my system, I discovered (with AVX2 enabled) d=12 was generally fastest. But I need to do more benchmarks on other systems to discover an appropriate default.

d_heap is still missing a few bits before it's production ready. At the very least:

[ ] shrink the C array when items are popped!!!
[ ] a simple JRuby version using java.util.PriorityQueue.
[ ] a fallback pure ruby version for systems which can't compile the C extension
[ ] test more hardware for a generically suitable default d
[ ] test 32-bit hardware for compatibility

I went a bit overboard learning about SIMD vectors and playing with SSE/AVX2/AVX512 and C macros in one of my unreleased branches. I did push it to even faster performance than gcc's -O3 with -funroll-loops, etc, but I completely forgot about cleaning it all up for production release! 😆

Types of Changes

Performance improvement.

Testing

[x] I added tests for my changes
- [x] extensive tests in the d_heap gem
- [x] increased numbers on existing perf spec to clarify differences
- [ ] intended to add a couple of specs similar to 92b2a81d to simulate realistic timer scenarios
[x] I tested my changes locally.
[ ] I tested my changes in staging.
[ ] I tested my changes in production.

Other variations to consider

go is very smart about lazily updating their timers heap, keeping a counter of all deleted or updated timers and flagging the individual timers. If you delete/adjust timers lazily, then most timers can be rescheduled before they incur any cancellation + re-insertion costs.

One variation I intended to benchmark but never got around to: use the bsearch implementation for smaller queues, but dynamically converting to a min-heap when it gets large. A sorted array already satisfies the heap property, so the only tricky part is tuning when to sort! and convert back to bsearch if/when the queue shrinks and stays small.

Another variation relies on 1) all timers with the same timeout can be handled with a FIFO, 2) CRuby's Array#shift is special-cased to allow O(1) queue behavior, and 3) the assumption that most timers will have integer timeouts of less than 5 minutes. So we could create an Array of timeouts => timers, e.g. Array.new(5 * 60) { [] }, with two additional priority queues: the first for any other timers (e.g. deadline-based, float duration, or longer than 5 minutes), and the second would select the minimum. The merged priority queue would hold no more than 301 items, so bsearch + insert would be super fast (d_heap would still be 4x faster). Any integer timeout < 5 minutes could be effectively O(1).

Another variation (I saw this in a python library, I forget which one) relies on the assumption that most timers are timeouts that we expect to be canceled (the operation will succeed before timing out). To that end, maintain both "near" and "far" collections. The "near" collection can be a min-heap, but the far collection is simply a Set or a Map with an ivar for "far_deadline". Whenever the "far_deadline" is "near", the map will be scanned (O(n)) for anything below the newly extended "far_deadline". The idea here is to get O(1) cancellation for most timers, because canceling timers is more common than firing them.

But IMO the two previous variations are both pointing towards various sorts of timing wheels. The Linux process scheduler uses one version with reduced precision based on how far away the deadline is when it's scheduled. And kafka uses hierarchical timing wheels. With timing wheels, we could effectively achieve O(1) for push, pop, and cancel. But timing wheels are conceptually much more complex than a simple min-heap. I figured, if gos timers can get by with a minheap, it's probably fine to just start there.

Mar 19 '21 17:03 nevans

I think you've come to similar conclusions to me - the overhead of memcpy only becomes an issue once the queue gets excessively large.

My response to that was the pending list of timers, many which are never actually invoked in practice.

Regarding the number of timers, it would be hard to imagine there ever being more than 10,000 timers at any one time, given other limitations of the system.

That being said, I agree that using a proper heap makes a lot of sense, so I think we should continue to explore this direction.

One option to consider is whether it would make more sense to introduce a priority queue implementation to Ruby itself. There are many places where I think this could be useful and I'd be happy to help with getting it merged. The other option is a gem... which is also okay.

Mar 20 '21 00:03 ioquatix

@WJWH can I get your feedback on this?

Mar 20 '21 00:03 ioquatix

Regarding the number of timers, it would be hard to imagine there ever being more than 10,000 timers at any one time, given other limitations of the system.

Yeah, my current largest production deployment is still using... eventmachine. EM's timers use C++ STLstd::multimap (i.e. rb-tree). The timers are not even remotely my bottleneck!

it would make more sense to introduce a priority queue implementation to Ruby itself

IMO, yes! 😄

If I remember correctly, Matz prefers (and I agree) that stdlib classes should represent abstract data types more than specific data structure implementations. But more than once, I've been annoyed that stdlib's SortedSet is completely broken without the rbtree gem (which was unmaintained until very recently). SortedHash would be nice to have in standard lib too. But with 3.0's new Fiber.scheduler, the absence of a PriorityQueue class is an especially noticeable omission. It'd be useful beyond just scheduling, too.

I actually started out by attempting to mirror the style of Array#sort, Array#sort!, Array#bsearch and Array#bsearch_index with a similar Array#heapify, Array#heapify!, Array#heap_push, and Array#heap_pop. This is (sort-of) what python's heapq does. But when the options to those methods and the number of methods started increasing, I switched to a dedicated priority queue class.

And then I discovered how slow comparisons with <=> are. In the worst case, a heap needs more comparisons per combined push + pop ((1 + d) log n / log d) than the bsearch + insert approach (log n / log 2), so it's critical than those run fast. I eventually figured out how to partially implement the same comparison optimizations used by several Enumerable methods in stdlib. But if you can stay in C with simple inline numeric comparisons, then either gcc's optimizations or CPU branch predictions take over and make everything so much faster (even without my crazy macro-unrolled SIMD loops). So I ultimately decided to only allow Float scores and revisit more generic comparisons later.

Mar 21 '21 04:03 nevans

The benefit of making timers super efficient is that you end up using them everywhere without concern. That’s part of the reason why I made the pending list, it makes unused timers almost free.

Mar 21 '21 10:03 ioquatix

100% agree that it'd be nice to just use a priority queue anywhere without any concern that it's the bottleneck. Priority Queues can be "hotspot" code for schedulers and other algorithms.

FWIW, I did grab the timers PriorityQueue and stack it up against the pure ruby implementations I'd been using for my d_heap benchmarks. These are all using random inputs; they're missing a more "realistic" benchmark to simulates a monotonically increasing timers heap, but I don't expect that to behave much differently from the "pushpop" benchmark with random values.

My latest (pure ruby) priority queue benchmark: https://gist.github.com/nevans/6473e9fe1bf95930c5382d86db863b5b . My Rb2Heap and Rb4Heap both outperform the timers priority queue by ~1.5x. The most important trick: only one assignment per layer (don't "swap"), and switching to a 4-heap with an unrolled loop for min-index is generally faster (sometimes just barely). Bsearch crosses over somewhere between 100k and 10k.

Some excerpts

                  push N (N = 3,162,278)
                    Rb4Heap:   5205985.6 i/s 
                    Rb2Heap:   3872629.7 i/s - 1.34x  slower
                 Timers gem:   3415913.9 i/s - 1.52x  slower
                    BSearch:         0.0 i/s 

                   push N (N =  100,000)
                    Rb4Heap:   5528770.6 i/s 
                    Rb2Heap:   4313109.2 i/s - 1.28x  slower
                 Timers gem:   3415591.6 i/s - 1.62x  slower
                    BSearch:    200902.2 i/s - 27.52x  slower

                   push N (N =   31,623)
                    Rb4Heap:   5863469.5 i/s 
                    Rb2Heap:   4416811.1 i/s - 1.33x  slower
                 Timers gem:   3462349.2 i/s - 1.69x  slower
                    BSearch:    563895.6 i/s - 10.40x  slower

                   push N (N =       10)
                    Rb4Heap:   6312993.5 i/s 
                    Rb2Heap:   5515562.1 i/s - 1.14x  slower
                    BSearch:   4258213.7 i/s - 1.48x  slower
                 Timers gem:   4226971.0 i/s - 1.49x  slower

               pushpop w/ size=3,162,278
                    Rb4Heap:    369297.9 i/s 
                    Rb2Heap:    337428.7 i/s - 1.09x  slower
                 Timers gem:    255441.3 i/s - 1.45x  slower
                    BSearch:         0.0 i/s 

               pushpop w/ size=  100,000
                    Rb4Heap:    459019.5 i/s 
                    Rb2Heap:    434168.4 i/s - 1.06x  slower
                 Timers gem:    290205.6 i/s - 1.58x  slower
                    BSearch:    228682.2 i/s - 2.01x  slower

               pushpop w/ size=   31,623
                    BSearch:    956823.2 i/s 
                    Rb4Heap:    507833.1 i/s - 1.88x  slower
                    Rb2Heap:    456219.1 i/s - 2.10x  slower
                 Timers gem:    310593.7 i/s - 3.08x  slower

               pushpop w/ size=       10
                    BSearch:   3613651.1 i/s 
                    Rb4Heap:   1616519.3 i/s - 2.24x  slower
                    Rb2Heap:   1604217.9 i/s - 2.25x  slower
                 Timers gem:   1111453.2 i/s - 3.25x  slower

             push N, pop N (N=3,162,278)
                    Rb4Heap:    812798.4 i/s 
                    Rb2Heap:    745647.4 i/s - 1.09x  slower
                 Timers gem:    553322.0 i/s - 1.47x  slower
                    BSearch:         0.0 i/s 

             push N, pop N (N=  100,000)
                    Rb4Heap:   1108310.9 i/s 
                    Rb2Heap:   1095127.5 i/s - 1.01x  slower
                 Timers gem:    834546.7 i/s - 1.33x  slower
                    BSearch:    401824.9 i/s - 2.76x  slower

             push N, pop N (N=   31,623)
                    Rb4Heap:   1261109.5 i/s 
                    Rb2Heap:   1226813.8 i/s - 1.03x  slower
                    BSearch:   1070724.9 i/s - 1.18x  slower
                 Timers gem:    904562.3 i/s - 1.39x  slower

             push N, pop N (N=       10)
                    BSearch:   6166773.3 i/s 
                    Rb4Heap:   3922461.5 i/s - 1.57x  slower
                    Rb2Heap:   3920965.2 i/s - 1.57x  slower
                 Timers gem:   3305228.9 i/s - 1.87x  slower

(N values were chosen to be multiples of sqrt(10))

Mar 24 '21 17:03 nevans

Great work on the benchmarks.

If we can make it an optional dependency, I'm fine to have it, it looks great. We still need to be careful with JRuby/TruffleRuby support, which is where a pure Ruby implementation is good.

Mar 24 '21 22:03 ioquatix

I'm so sorry, I just completely fell off the map with all of this and a bunch of other little open source stuff I'd wanted to accomplish this year. I'm going to try and clean up some of my code prior to RubyConf in November. Would a heap written in a C extension or something like it still be useful in this gem (or could we even push something like it into stdlib in time for 3.1?)

Oct 18 '21 14:10 nevans

Ah, and even if we don't go to a C extension, I just noticed (remembered) that I had a carefully optimized pure ruby implementation, linked in the gist above. I can create a PR with just that, too. I haven't run benchmarks in JRuby or TruffleRuby yet, but I expect this is the sort of benchmark they naturally handle very well with anyway.

Oct 18 '21 14:10 nevans

I had a carefully optimized pure ruby implementation, linked in the gist above. I can create a PR with just that, too.

I think that's a good idea.

Oct 18 '21 15:10 chuckremes

timers
timers copied to clipboard

Huge performance improvements, using d_heap gem.

Description

Types of Changes

Testing

Other variations to consider

timers timers copied to clipboard

Huge performance improvements, using d_heap gem.

Description

Types of Changes

Testing

Other variations to consider

timers
timers copied to clipboard