haskell-lockfree icon indicating copy to clipboard operation
haskell-lockfree copied to clipboard

[lockfree-queue] Pad byte array with two cachelines to avoid false sharing in AtomicCounter

Open jberryman opened this issue 11 years ago • 1 comments

I played with a branch jberryman@23497b4cab6b714fdceebab16cf9e372c5da2c7a that implements the atomic counter with a padding of 64bytes on either side, so that it always sits in its own cache line (at least hopefully on x86, where AtomicCounter is most useful anyway).

In criterion benchmarks of a counter-based queue I'm working on I get a 17 - 23% improvement in runtime on those benchmarks where we start to use the heap (maxing out at 1,000,000 Ints and 1,000 arrays of size 1,000... whatever that amounts to).

I modified the test slightly to run with perf, and some of the cache metrics do appear to be better (but others a bit worse, maybe that's just expected variance).

Here's the slow version with current counter:

 Performance counter stats for './multi_cache_slow' (100 runs):

        401.063542      task-clock (msec)         #    3.745 CPUs utilized            ( +-  0.64% )
               300      context-switches          #    0.747 K/sec                    ( +- 27.94% )
                20      cpu-migrations            #    0.049 K/sec                    ( +-  1.70% )
             2,999      page-faults               #    0.007 M/sec                    ( +-  0.01% )
       726,617,399      cycles                    #    1.812 GHz                      ( +-  1.11% ) [24.21%]
       623,872,915      stalled-cycles-frontend   #   85.86% frontend cycles idle     ( +-  0.94% ) [28.13%]
       427,867,935      instructions              #    0.59  insns per cycle        
                                                  #    1.46  stalled cycles per insn  ( +-  1.83% ) [37.28%]
           268,108      cache-misses              #    5.209 % of all cache refs      ( +-  0.99% ) [40.43%]
         5,146,579      cache-references          #   12.832 M/sec                    ( +-  1.08% ) [43.64%]
         5,616,736      L1-dcache-load-misses     #   14.005 M/sec                    ( +-  1.07% ) [43.16%]
           497,307      L1-dcache-store-misses    #    1.240 M/sec                    ( +-  1.03% ) [40.69%]
         1,470,803      L1-dcache-prefetch-misses #    3.667 M/sec                    ( +-  2.46% ) [29.46%]
           102,164      L1-icache-load-misses     #    0.255 M/sec                    ( +-  1.71% ) [27.00%]
         1,939,392      LLC-loads                 #    4.836 M/sec                    ( +-  1.74% ) [24.80%]
         2,696,163      LLC-stores                #    6.723 M/sec                    ( +-  1.81% ) [23.33%]
         6,026,990      LLC-prefetches            #   15.028 M/sec                    ( +-  2.52% ) [10.68%]
            12,353      iTLB-loads                #    0.031 M/sec                    ( +- 10.40% ) [25.89%]
             9,746      iTLB-load-misses          #   78.89% of all iTLB cache hits   ( +-  2.94% ) [25.01%]
        88,551,617      branch-loads              #  220.792 M/sec                    ( +-  2.43% ) [23.60%]
            89,617      branch-load-misses        #    0.223 M/sec                    ( +-  6.05% ) [22.22%]

       0.107099418 seconds time elapsed                                          ( +-  0.72% )

And fast version:

 Performance counter stats for './multi_cache_fast' (100 runs):

        333.044322      task-clock (msec)         #    3.701 CPUs utilized            ( +-  0.81% )
               221      context-switches          #    0.663 K/sec                    ( +- 16.25% )
                20      cpu-migrations            #    0.059 K/sec                    ( +-  1.70% )
             2,999      page-faults               #    0.009 M/sec                    ( +-  0.02% )
       748,219,452      cycles                    #    2.247 GHz                      ( +-  0.96% ) [23.82%]
       634,355,295      stalled-cycles-frontend   #   84.78% frontend cycles idle     ( +-  1.08% ) [27.57%]
       471,072,726      instructions              #    0.63  insns per cycle        
                                                  #    1.35  stalled cycles per insn  ( +-  1.15% ) [35.37%]
           301,361      cache-misses              #    6.418 % of all cache refs      ( +-  1.59% ) [37.17%]
*        4,695,679      cache-references          #   14.099 M/sec                    ( +-  1.19% ) [38.81%]
*        4,828,687      L1-dcache-load-misses     #   14.499 M/sec                    ( +-  1.38% ) [37.72%]
           654,043      L1-dcache-store-misses    #    1.964 M/sec                    ( +-  1.89% ) [35.95%]
*          592,677      L1-dcache-prefetch-misses #    1.780 M/sec                    ( +-  2.59% ) [29.03%]
*           66,387      L1-icache-load-misses     #    0.199 M/sec                    ( +-  3.89% ) [34.49%]
*          573,885      LLC-loads                 #    1.723 M/sec                    ( +-  4.69% ) [34.30%]
*        1,865,970      LLC-stores                #    5.603 M/sec                    ( +-  2.98% ) [32.93%]
         4,460,103      LLC-prefetches            #   13.392 M/sec                    ( +-  1.99% ) [15.90%]
            16,509      iTLB-loads                #    0.050 M/sec                    ( +-  9.41% ) [27.67%]
*            7,565      iTLB-load-misses          #   45.82% of all iTLB cache hits   ( +-  1.98% ) [25.46%]
        97,310,210      branch-loads              #  292.184 M/sec                    ( +-  2.42% ) [23.51%]
*           74,815      branch-load-misses        #    0.225 M/sec                    ( +-  7.71% ) [21.80%]

       0.089992717 seconds time elapsed                                          ( +-  0.91% )

I also have a version that uses a pinned, aligned MutableByteArray and has the same benefits at just 64 bytes, but I think that would be a mistake and could contribute to memory fragmentation.

If my benchmarks aren't deceiving me I think this change would probably be worth the wasted 127 bytes for most uses of the AtomicCounter that I can think of. Any interest in this sort of change? If not, it's no trouble for me to create my own fat counter.

jberryman avatar May 29 '14 16:05 jberryman

So actually if I double the numbers in my patch (using an array of 256 bytes) I seem to get very slightly better benchmarks, suggesting that something else (or in addition) may be going on. I would be interested to see if this improves things for anyone else using AtomicCounter. I need to move on for now.

jberryman avatar May 29 '14 22:05 jberryman