ekg icon indicating copy to clipboard operation
ekg copied to clipboard

Imprecise counters to avoid issues with contended `atomicModifyIORef`

Open nominolo opened this issue 10 years ago • 5 comments

[Since @tibbe wanted to be kept in the loop.] We're currently testing if these ideas are ready for our production system and then will integrate them back into upstream. So, here's a quick summary of what we're doing:

Problem: We use a lot of EKG counters to keep track of various code paths of our system. Each counter is incremented using atomicModifyIORef and our binaries usually run with 4 or more hardware threads. We noticed on microbenchmarks that these counters add a non-trivial overhead, most likely because of lots of cache-line contention an the busy-waiting loop used by atomicModifyIORef. We could use modifyIORef, but that would likely get less and less precise the more cores are involved.

Proposed Solution: The basic idea is to use one counter per capability and then sum them up when we reading the current value. This is still a bit imprecise, but probably much less so than the IORef approach. So, you represent a counter as a byte array and each capability writes to a different part of that byte array. These writes don't have to be atomic. For a slight improvement in performance we also make sure that two cores don't share the same cache line. The other parts of the cache line can be used for other counters. So the array looks as follows:

| capability 0      | capabality 1      | capability 2      |
+-----------------------------------------------------------+
| c1 | c2 | c3 | c4 | c1 | c2 | c3 | c4 | c1 | c2 | c3 | c4 |
+-----------------------------------------------------------+
   |<---- stride ----->|

The per-capability counters for counter c1 are at offset 0, 1 * stride + 0, 2 * stride + 0. The current value of the counter is the sum of all of these per-capability counters. Since we cannot read all values at the same time, we have a race condition, but that's why they're imprecise counters.

Due to this imprecision they should be displayed as imprecise counters in the UI as well.

nominolo avatar Mar 07 '14 22:03 nominolo

Have a look at https://ghc.haskell.org/trac/ghc/ticket/8972#comment:1. It looks like contended IORefs scale very poorly. I will try to use the C implementation I gave in that ticket in ekg. Should make things up to orders of magnitude faster.

tibbe avatar Apr 08 '14 13:04 tibbe

can't post on trac for some reason right now (the spam filter dislikes me), but related tickets that compliment the unboxed reference idea are https://ghc.haskell.org/trac/ghc/ticket/8157 and https://ghc.haskell.org/trac/ghc/ticket/7883#comment:16

cartazio avatar Apr 08 '14 14:04 cartazio

atomicmodifyIORef has a whole busy wait loop structure that probably adds a bunch of overhead too

cartazio avatar Apr 08 '14 14:04 cartazio

the fetch-and-add primop exposed in atomic-primops tends to perform very well (on x86) if you want a slightly simpler solution. Be sure to make your own counter that's padded on it's own cache line

jberryman avatar Mar 10 '16 17:03 jberryman

yeah, cacheline width padding is a VERY important detail

cartazio avatar Mar 11 '16 03:03 cartazio