opentelemetry-cpp icon indicating copy to clipboard operation
opentelemetry-cpp copied to clipboard

New circular buffer using faa instead of cas to optimize performance

Open ZENOTME opened this issue 3 months ago • 4 comments

solve #3645

Changes

I noticed that original circular buffer using by batch span processor use lock free MPSC queu based on CAS. When experiencing intense multithreading contention, Compare-And-Swap (CAS) exhibits poorer scalability compared to Fetch-And-Add (FAA). In reference of https://github.com/dbittman/waitfree-mpsc-queue, this pr attempt to implement a wait-free MPSC (Multiple Producer, Single Consumer) queue using FAA. Base on original benchmark, it indicate that this approach demonstrates better performance scalability.

Run on (48 X 2593.99 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 256 KiB (x24)
  L3 Unified 30720 KiB (x2)
Load Average: 7.85, 5.70, 4.48
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_BaselineBuffer/1     10178537 ns        51528 ns         1000
BM_BaselineBuffer/2      7408646 ns        69828 ns         1000
BM_BaselineBuffer/4      7684772 ns       127549 ns         1000
BM_BaselineBuffer/8      7222459 ns       278660 ns         1000
BM_BaselineBuffer/16     6716972 ns       603712 ns         1215
BM_LockFreeBuffer/1      3915343 ns        53125 ns         1000
BM_LockFreeBuffer/2      4798406 ns        70581 ns         1000
BM_LockFreeBuffer/4      4562709 ns       128493 ns         1000
BM_LockFreeBuffer/8      4935221 ns       289996 ns         1000
BM_LockFreeBuffer/16     5187913 ns       618856 ns         1081
BM_OptimizedBuffer/1     4256507 ns        49970 ns         1000
BM_OptimizedBuffer/2     3398719 ns        67712 ns         1000
BM_OptimizedBuffer/4     3204749 ns       127378 ns         1000
BM_OptimizedBuffer/8     3230722 ns       296507 ns         1000
BM_OptimizedBuffer/16    3859005 ns       769220 ns         1000

I would like to refine this PR if it's acceptable.

ZENOTME avatar Sep 17 '25 13:09 ZENOTME

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 90.08%. Comparing base (aeb5a01) to head (1836bbf). :warning: Report is 50 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3644      +/-   ##
==========================================
+ Coverage   90.03%   90.08%   +0.06%     
==========================================
  Files         220      220              
  Lines        7065     7111      +46     
==========================================
+ Hits         6360     6405      +45     
- Misses        705      706       +1     
Files with missing lines Coverage Δ
...include/opentelemetry/sdk/common/circular_buffer.h 98.15% <ø> (-1.85%) :arrow_down:
...e/opentelemetry/sdk/common/circular_buffer_range.h 100.00% <100.00%> (ø)

... and 21 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Sep 17 '25 13:09 codecov[bot]

On Mac M1 I get:

Run on (10 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x10)
Load Average: 9.93, 6.67, 5.87
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_BaselineBuffer/1      2789978 ns        32254 ns         1000
BM_BaselineBuffer/2      2949209 ns        46430 ns         1000
BM_BaselineBuffer/4      1956026 ns        72750 ns         9651
BM_BaselineBuffer/8      1690755 ns       117763 ns         5736
BM_BaselineBuffer/16     1491443 ns       212598 ns         3275
BM_LockFreeBuffer/1      1223514 ns        32825 ns        10000
BM_LockFreeBuffer/2      1369947 ns        45685 ns        10000
BM_LockFreeBuffer/4      2768243 ns        76035 ns         1000
BM_LockFreeBuffer/8      3852933 ns       135797 ns         1000
BM_LockFreeBuffer/16     4341987 ns       259862 ns         1000
BM_OptimizedBuffer/1     1471849 ns        32878 ns        10000
BM_OptimizedBuffer/2     1138345 ns        45200 ns        10000
BM_OptimizedBuffer/4     1536488 ns        73745 ns         9610
BM_OptimizedBuffer/8     1577325 ns       126947 ns         5565
BM_OptimizedBuffer/16    1833288 ns       237527 ns         2986

So the result is similar to SpinLockMutex: the higher the contention, the less efficient atomics are, and the better std::mutex performs.

Reneg973 avatar Oct 26 '25 03:10 Reneg973

And a compiler fix for libc (MacOS) is also required, because size_t is of type unsigned long here:

   uint64_t max_check = std::min<uint64_t>(current_count, capacity_);

Reneg973 avatar Oct 26 '25 09:10 Reneg973