ekg
ekg copied to clipboard
Imprecise counters to avoid issues with contended `atomicModifyIORef`
[Since @tibbe wanted to be kept in the loop.] We're currently testing if these ideas are ready for our production system and then will integrate them back into upstream. So, here's a quick summary of what we're doing:
Problem: We use a lot of EKG counters to keep track of various code paths of our system. Each counter is incremented using atomicModifyIORef
and our binaries usually run with 4 or more hardware threads. We noticed on microbenchmarks that these counters add a non-trivial overhead, most likely because of lots of cache-line contention an the busy-waiting loop used by atomicModifyIORef
. We could use modifyIORef, but that would likely get less and less precise the more cores are involved.
Proposed Solution: The basic idea is to use one counter per capability and then sum them up when we reading the current value. This is still a bit imprecise, but probably much less so than the IORef approach. So, you represent a counter as a byte array and each capability writes to a different part of that byte array. These writes don't have to be atomic. For a slight improvement in performance we also make sure that two cores don't share the same cache line. The other parts of the cache line can be used for other counters. So the array looks as follows:
| capability 0 | capabality 1 | capability 2 |
+-----------------------------------------------------------+
| c1 | c2 | c3 | c4 | c1 | c2 | c3 | c4 | c1 | c2 | c3 | c4 |
+-----------------------------------------------------------+
|<---- stride ----->|
The per-capability counters for counter c1
are at offset 0
, 1 * stride + 0
, 2 * stride + 0
. The current value of the counter is the sum of all of these per-capability counters. Since we cannot read all values at the same time, we have a race condition, but that's why they're imprecise counters.
Due to this imprecision they should be displayed as imprecise counters in the UI as well.
Have a look at https://ghc.haskell.org/trac/ghc/ticket/8972#comment:1. It looks like contended IORef
s scale very poorly. I will try to use the C implementation I gave in that ticket in ekg. Should make things up to orders of magnitude faster.
can't post on trac for some reason right now (the spam filter dislikes me), but related tickets that compliment the unboxed reference idea are https://ghc.haskell.org/trac/ghc/ticket/8157 and https://ghc.haskell.org/trac/ghc/ticket/7883#comment:16
atomicmodifyIORef has a whole busy wait loop structure that probably adds a bunch of overhead too
the fetch-and-add primop exposed in atomic-primops
tends to perform very well (on x86) if you want a slightly simpler solution. Be sure to make your own counter that's padded on it's own cache line
yeah, cacheline width padding is a VERY important detail