[lockfree-queue] Pad byte array with two cachelines to avoid false sharing in AtomicCounter
I played with a branch jberryman@23497b4cab6b714fdceebab16cf9e372c5da2c7a that implements the atomic counter with a padding of 64bytes on either side, so that it always sits in its own cache line (at least hopefully on x86, where AtomicCounter is most useful anyway).
In criterion benchmarks of a counter-based queue I'm working on I get a 17 - 23% improvement in runtime on those benchmarks where we start to use the heap (maxing out at 1,000,000 Ints and 1,000 arrays of size 1,000... whatever that amounts to).
I modified the test slightly to run with perf, and some of the cache metrics do appear to be better (but others a bit worse, maybe that's just expected variance).
Here's the slow version with current counter:
Performance counter stats for './multi_cache_slow' (100 runs):
401.063542 task-clock (msec) # 3.745 CPUs utilized ( +- 0.64% )
300 context-switches # 0.747 K/sec ( +- 27.94% )
20 cpu-migrations # 0.049 K/sec ( +- 1.70% )
2,999 page-faults # 0.007 M/sec ( +- 0.01% )
726,617,399 cycles # 1.812 GHz ( +- 1.11% ) [24.21%]
623,872,915 stalled-cycles-frontend # 85.86% frontend cycles idle ( +- 0.94% ) [28.13%]
427,867,935 instructions # 0.59 insns per cycle
# 1.46 stalled cycles per insn ( +- 1.83% ) [37.28%]
268,108 cache-misses # 5.209 % of all cache refs ( +- 0.99% ) [40.43%]
5,146,579 cache-references # 12.832 M/sec ( +- 1.08% ) [43.64%]
5,616,736 L1-dcache-load-misses # 14.005 M/sec ( +- 1.07% ) [43.16%]
497,307 L1-dcache-store-misses # 1.240 M/sec ( +- 1.03% ) [40.69%]
1,470,803 L1-dcache-prefetch-misses # 3.667 M/sec ( +- 2.46% ) [29.46%]
102,164 L1-icache-load-misses # 0.255 M/sec ( +- 1.71% ) [27.00%]
1,939,392 LLC-loads # 4.836 M/sec ( +- 1.74% ) [24.80%]
2,696,163 LLC-stores # 6.723 M/sec ( +- 1.81% ) [23.33%]
6,026,990 LLC-prefetches # 15.028 M/sec ( +- 2.52% ) [10.68%]
12,353 iTLB-loads # 0.031 M/sec ( +- 10.40% ) [25.89%]
9,746 iTLB-load-misses # 78.89% of all iTLB cache hits ( +- 2.94% ) [25.01%]
88,551,617 branch-loads # 220.792 M/sec ( +- 2.43% ) [23.60%]
89,617 branch-load-misses # 0.223 M/sec ( +- 6.05% ) [22.22%]
0.107099418 seconds time elapsed ( +- 0.72% )
And fast version:
Performance counter stats for './multi_cache_fast' (100 runs):
333.044322 task-clock (msec) # 3.701 CPUs utilized ( +- 0.81% )
221 context-switches # 0.663 K/sec ( +- 16.25% )
20 cpu-migrations # 0.059 K/sec ( +- 1.70% )
2,999 page-faults # 0.009 M/sec ( +- 0.02% )
748,219,452 cycles # 2.247 GHz ( +- 0.96% ) [23.82%]
634,355,295 stalled-cycles-frontend # 84.78% frontend cycles idle ( +- 1.08% ) [27.57%]
471,072,726 instructions # 0.63 insns per cycle
# 1.35 stalled cycles per insn ( +- 1.15% ) [35.37%]
301,361 cache-misses # 6.418 % of all cache refs ( +- 1.59% ) [37.17%]
* 4,695,679 cache-references # 14.099 M/sec ( +- 1.19% ) [38.81%]
* 4,828,687 L1-dcache-load-misses # 14.499 M/sec ( +- 1.38% ) [37.72%]
654,043 L1-dcache-store-misses # 1.964 M/sec ( +- 1.89% ) [35.95%]
* 592,677 L1-dcache-prefetch-misses # 1.780 M/sec ( +- 2.59% ) [29.03%]
* 66,387 L1-icache-load-misses # 0.199 M/sec ( +- 3.89% ) [34.49%]
* 573,885 LLC-loads # 1.723 M/sec ( +- 4.69% ) [34.30%]
* 1,865,970 LLC-stores # 5.603 M/sec ( +- 2.98% ) [32.93%]
4,460,103 LLC-prefetches # 13.392 M/sec ( +- 1.99% ) [15.90%]
16,509 iTLB-loads # 0.050 M/sec ( +- 9.41% ) [27.67%]
* 7,565 iTLB-load-misses # 45.82% of all iTLB cache hits ( +- 1.98% ) [25.46%]
97,310,210 branch-loads # 292.184 M/sec ( +- 2.42% ) [23.51%]
* 74,815 branch-load-misses # 0.225 M/sec ( +- 7.71% ) [21.80%]
0.089992717 seconds time elapsed ( +- 0.91% )
I also have a version that uses a pinned, aligned MutableByteArray and has the same benefits at just 64 bytes, but I think that would be a mistake and could contribute to memory fragmentation.
If my benchmarks aren't deceiving me I think this change would probably be worth the wasted 127 bytes for most uses of the AtomicCounter that I can think of. Any interest in this sort of change? If not, it's no trouble for me to create my own fat counter.
So actually if I double the numbers in my patch (using an array of 256 bytes) I seem to get very slightly better benchmarks, suggesting that something else (or in addition) may be going on. I would be interested to see if this improves things for anyone else using AtomicCounter. I need to move on for now.