trafficserver ArenaBlock's defaultSizeArenaBlock allocator under high contention on a busy server

Posting an issue about this for discussion to come up with a good remedy.

With freelists on, and under high transaction rates, a perf flame graph shows a lot of time is spent in freelist_new/freelist_free under Arena::alloc or Arena::reset. This is due to the freelist implementation performing atomic CAS operations in a loop. I wrote a benchmark to exercise this and added a bit of instrumentation to count how many times the CAS fails.

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
global allocator                               100             1     2.17617 s
                                        18.6112 ms    17.8946 ms    19.1444 ms
                                        3.13175 ms    2.48935 ms    4.16424 ms

thread allocator                               100             1    515.923 ms
                                        5.09905 ms    4.97022 ms    5.34322 ms
                                        872.118 us    562.584 us    1.52475 ms


max global loop count: 226979
max local loop count: 0

The difference between these benchmarks is that global allocator is marked static (as it is in ATS ArenaBlock), but thread allocator is marked thread_local. Otherwise the code is identical.

The benchmark shows that its much faster thread_local, but also I added a loop_counter that keeps track of how many times the CAS fails when alloc/free from the freelists. The max global loop count here is the worst number of CAS misses for all of the benchmark iterations. For thread local, this is 0 because there should be no contention and the CAS will never fail. For the global allocator though, there are 20 threads vying to allocate 1000 items each from the same allocator instance. Doing the math, that should be 20 * 1000 * 2 = 40000 CAS attempts, but there were 226979 misses which is 266979 CAS operations or about 7 tries to accomplish one CAS. In other words, the thread local allocator is 7x better in this scenario.

We tested marking this defaultSizeArenaBlock allocator thread_local and there was a significant performance increase.

I see a couple of discussion items around this:

With freelists off, this allocator uses system malloc (or a thread_local jemalloc heap) for allocation, so in some situations, marking this thread_local ends up using two thread_local variables which seems odd.
Normally this might be a use for a ProxyAllocator, but code organization-wise, Arena is in tscore where ProxyAllocator is in iocore/eventsystem so the dependency goes the wrong direction.

The most direct solution is just to mark this defaultSizeArenaBlock allocator thread_local, but I'd like to get some community feedback on this issue.

May 05 '22 17:05 cmcfarlen

Thanks for filing this @cmcfarlen:

This really is worth the community time to find a way forward as initial testing of the prototype realised a staggering increase in transaction rate. I was only able to test one system type in the lab, but do plan to do more.

~545,000 RPS (9.2.x as-is) ~720,000 RPS (thread_local prototype)

After working together testing https://github.com/apache/trafficserver/pull/8805 it quickly became apparent that this was the next hot path.

May 05 '22 20:05 c-taylor

If we're looking at different possible implementations for the arena, this is an option: https://github.com/wkaras/heap-memory-manager

May 09 '22 23:05 ywkaras

This issue has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.

May 10 '23 01:05 github-actions[bot]

trafficserver trafficserver copied to clipboard

ArenaBlock's defaultSizeArenaBlock allocator under high contention on a busy server

trafficserver
trafficserver copied to clipboard