trafficserver
trafficserver copied to clipboard
ArenaBlock's defaultSizeArenaBlock allocator under high contention on a busy server
Posting an issue about this for discussion to come up with a good remedy.
With freelists on, and under high transaction rates, a perf flame graph shows a lot of time is spent in freelist_new
/freelist_free
under Arena::alloc
or Arena::reset
. This is due to the freelist implementation performing atomic CAS operations in a loop. I wrote a benchmark to exercise this and added a bit of instrumentation to count how many times the CAS fails.
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
global allocator 100 1 2.17617 s
18.6112 ms 17.8946 ms 19.1444 ms
3.13175 ms 2.48935 ms 4.16424 ms
thread allocator 100 1 515.923 ms
5.09905 ms 4.97022 ms 5.34322 ms
872.118 us 562.584 us 1.52475 ms
max global loop count: 226979
max local loop count: 0
The difference between these benchmarks is that global allocator
is marked static (as it is in ATS ArenaBlock), but thread allocator
is marked thread_local
. Otherwise the code is identical.
The benchmark shows that its much faster thread_local, but also I added a loop_counter that keeps track of how many times the CAS fails when alloc/free from the freelists. The max global loop count here is the worst number of CAS misses for all of the benchmark iterations. For thread local, this is 0 because there should be no contention and the CAS will never fail. For the global allocator though, there are 20 threads vying to allocate 1000 items each from the same allocator instance. Doing the math, that should be 20 * 1000 * 2 = 40000
CAS attempts, but there were 226979
misses which is 266979
CAS operations or about 7
tries to accomplish one CAS. In other words, the thread local allocator is 7x better in this scenario.
We tested marking this defaultSizeArenaBlock
allocator thread_local and there was a significant performance increase.
I see a couple of discussion items around this:
- With freelists off, this allocator uses system malloc (or a thread_local jemalloc heap) for allocation, so in some situations, marking this thread_local ends up using two thread_local variables which seems odd.
- Normally this might be a use for a
ProxyAllocator
, but code organization-wise, Arena is intscore
whereProxyAllocator
is iniocore/eventsystem
so the dependency goes the wrong direction.
The most direct solution is just to mark this defaultSizeArenaBlock
allocator thread_local
, but I'd like to get some community feedback on this issue.
Thanks for filing this @cmcfarlen:
This really is worth the community time to find a way forward as initial testing of the prototype realised a staggering increase in transaction rate. I was only able to test one system type in the lab, but do plan to do more.
~545,000 RPS (9.2.x as-is) ~720,000 RPS (thread_local prototype)
After working together testing https://github.com/apache/trafficserver/pull/8805 it quickly became apparent that this was the next hot path.
If we're looking at different possible implementations for the arena, this is an option: https://github.com/wkaras/heap-memory-manager
This issue has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.