New circular buffer using faa instead of cas to optimize performance
solve #3645
Changes
I noticed that original circular buffer using by batch span processor use lock free MPSC queu based on CAS. When experiencing intense multithreading contention, Compare-And-Swap (CAS) exhibits poorer scalability compared to Fetch-And-Add (FAA). In reference of https://github.com/dbittman/waitfree-mpsc-queue, this pr attempt to implement a wait-free MPSC (Multiple Producer, Single Consumer) queue using FAA. Base on original benchmark, it indicate that this approach demonstrates better performance scalability.
Run on (48 X 2593.99 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x24)
L1 Instruction 32 KiB (x24)
L2 Unified 256 KiB (x24)
L3 Unified 30720 KiB (x2)
Load Average: 7.85, 5.70, 4.48
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
BM_BaselineBuffer/1 10178537 ns 51528 ns 1000
BM_BaselineBuffer/2 7408646 ns 69828 ns 1000
BM_BaselineBuffer/4 7684772 ns 127549 ns 1000
BM_BaselineBuffer/8 7222459 ns 278660 ns 1000
BM_BaselineBuffer/16 6716972 ns 603712 ns 1215
BM_LockFreeBuffer/1 3915343 ns 53125 ns 1000
BM_LockFreeBuffer/2 4798406 ns 70581 ns 1000
BM_LockFreeBuffer/4 4562709 ns 128493 ns 1000
BM_LockFreeBuffer/8 4935221 ns 289996 ns 1000
BM_LockFreeBuffer/16 5187913 ns 618856 ns 1081
BM_OptimizedBuffer/1 4256507 ns 49970 ns 1000
BM_OptimizedBuffer/2 3398719 ns 67712 ns 1000
BM_OptimizedBuffer/4 3204749 ns 127378 ns 1000
BM_OptimizedBuffer/8 3230722 ns 296507 ns 1000
BM_OptimizedBuffer/16 3859005 ns 769220 ns 1000
I would like to refine this PR if it's acceptable.
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 90.08%. Comparing base (aeb5a01) to head (1836bbf).
:warning: Report is 50 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #3644 +/- ##
==========================================
+ Coverage 90.03% 90.08% +0.06%
==========================================
Files 220 220
Lines 7065 7111 +46
==========================================
+ Hits 6360 6405 +45
- Misses 705 706 +1
| Files with missing lines | Coverage Δ | |
|---|---|---|
| ...include/opentelemetry/sdk/common/circular_buffer.h | 98.15% <ø> (-1.85%) |
:arrow_down: |
| ...e/opentelemetry/sdk/common/circular_buffer_range.h | 100.00% <100.00%> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
On Mac M1 I get:
Run on (10 X 24 MHz CPU s)
CPU Caches:
L1 Data 64 KiB
L1 Instruction 128 KiB
L2 Unified 4096 KiB (x10)
Load Average: 9.93, 6.67, 5.87
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
BM_BaselineBuffer/1 2789978 ns 32254 ns 1000
BM_BaselineBuffer/2 2949209 ns 46430 ns 1000
BM_BaselineBuffer/4 1956026 ns 72750 ns 9651
BM_BaselineBuffer/8 1690755 ns 117763 ns 5736
BM_BaselineBuffer/16 1491443 ns 212598 ns 3275
BM_LockFreeBuffer/1 1223514 ns 32825 ns 10000
BM_LockFreeBuffer/2 1369947 ns 45685 ns 10000
BM_LockFreeBuffer/4 2768243 ns 76035 ns 1000
BM_LockFreeBuffer/8 3852933 ns 135797 ns 1000
BM_LockFreeBuffer/16 4341987 ns 259862 ns 1000
BM_OptimizedBuffer/1 1471849 ns 32878 ns 10000
BM_OptimizedBuffer/2 1138345 ns 45200 ns 10000
BM_OptimizedBuffer/4 1536488 ns 73745 ns 9610
BM_OptimizedBuffer/8 1577325 ns 126947 ns 5565
BM_OptimizedBuffer/16 1833288 ns 237527 ns 2986
So the result is similar to SpinLockMutex: the higher the contention, the less efficient atomics are, and the better std::mutex performs.
And a compiler fix for libc (MacOS) is also required, because size_t is of type unsigned long here:
uint64_t max_check = std::min<uint64_t>(current_count, capacity_);