CRoaring icon indicating copy to clipboard operation
CRoaring copied to clipboard

Perf regression from 4.1.2 to 4.4.2?

Open madscientist opened this issue 2 months ago • 10 comments

I found an issue with the iterator in 4.1.2 that was fixed in 4.4.2, so I was going to upgrade. But, I noticed in our microbenchmarks that 4.4.2 was slower than 4.1.2. I checked with the CRoaring microbenchmarks and can see a similar slowdown, almost across the board.

Is this a known regression that we are accepting for other reasons, or is it not expected? FYI, I'm compiling with GCC 14.2.0 on Intel(R) Xeon Gold 6248R CPU @ 3.00GHz, default flags from the CRoaring CMake environment.

Here is a comparison using the default microbenchmark: I used some Emacs rectangle kill/yank foo to add the comparison (I know there's some fancy tooling to compare Google Bench output but I forget how to do it :) ) You can see that many of the values are worse for 4.4.2: fewer iterations / longer CPU time:

2025-10-08T11:49:06-04:00
Running ./microbenchmarks/bench
Run on (12 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x12)
  L1 Instruction 32 KiB (x12)
  L2 Unified 1024 KiB (x12)
  L3 Unified 36608 KiB (x12)
Load Average: 0.24, 0.37, 0.26
AVX-2 hardware: yes
AVX-512: supported by compiler
AVX-512 hardware: no
In RAM volume in MiB (estimated): 1.802830
benchmarking other files: You may pass is a data directory as a parameter.
data source: /data/psh13/src/CRoaring/benchmarks/realdata/census1881
number of bitmaps: 200
performance counters: No privileged access (sudo may help).
x64: detected
                                                 4.1.2                    4.4.2
----------------------------------------------------------------------------------------
Benchmark                                    Time   Iterations         Time   Iterations
----------------------------------------------------------------------------------------
SuccessiveIntersection                   28848 ns        24252     29228 ns        23898
SuccessiveIntersection64                 46578 ns        15024     49466 ns        14136
SuccessiveIntersectionCardinality        26155 ns        26779     27928 ns        25011
SuccessiveIntersectionCardinality64      42673 ns        16406     43998 ns        15883
SuccessiveUnionCardinality               36853 ns        18958     36438 ns        19159
SuccessiveUnionCardinality64             97501 ns         7187    100082 ns         6984
SuccessiveDifferenceCardinality          32514 ns        21540     33503 ns        20863
SuccessiveDifferenceCardinality64        69695 ns        10041     72552 ns         9645
SuccessiveUnion                         573901 ns         1227    558773 ns         1255
SuccessiveUnion64                       935391 ns          745    931132 ns          751
TotalUnion                              637562 ns         1097    633914 ns         1103
TotalUnionHeap                         2014668 ns          347   1952174 ns          358
RandomAccess                              3531 ns       196919      3514 ns       199751
RandomAccess64                            5811 ns       121117      6394 ns       109562
RandomAccess64Cpp                         4500 ns       132872      4554 ns       153998
ToArray                                 148464 ns         4702    148486 ns         4697
ToArray64                               604936 ns         1156    615433 ns         1134
IterateAll                             3425703 ns          204   3431836 ns          204
IterateAll64                           5007200 ns          142   5078060 ns          139
ComputeCardinality                        3928 ns       179276      3638 ns       191939
ComputeCardinality64                     22224 ns        31109     24014 ns        29137
RankManySlow                             16732 ns        41747     15564 ns        44669
RankMany                                  6382 ns       109172      6171 ns       111491

A few things (Union tests for example) are faster but a number are slower, particularly 64bit (which is, unfortunately, mostly what I use).

If this is interesting / new info I can do some bisecting across versions to get more details. If this is known / understood then I won't bother.

madscientist avatar Oct 08 '25 16:10 madscientist

This is likely from PR #688, as it significantly restructured the 64-bit internals. In case you're interested, that PR also added some new synthetic benchmarks which focus more on higher-order bits, see microbenchmarks/synthetic_bench.cpp and the commit message of https://github.com/RoaringBitmap/CRoaring/commit/f30bfa90848edcd446dc674ae8f71a040f72067a for a sample benchmark run.

SLieve avatar Oct 08 '25 19:10 SLieve

Please run the following Python script: https://gist.github.com/lemire/fabe642b256987f94695c03b3d7c2515

lemire avatar Oct 08 '25 19:10 lemire

Running my own script I do spot a significant regression on the IterateAll64 benchmark which is not part of your results.

Image

And a small degradation in RandomAccess64.

Image

I do not see anything else that has degraded.

I checked with the CRoaring microbenchmarks and can see a similar slowdown, almost across the board.

Please be careful. A simple frequency change could explain the difference. Caching issues and other issues also impact the performance.

Please report the results with performance counters...

You can get the desired results under macOS or Linux, often by running the benchmark under sudo (privileged mode).

Here are my results....

4.1.2:

----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
SuccessiveIntersection                   19879 ns        19808 ns        36340 GHz=4.73744 cycles=88.628k instructions=412.548k
SuccessiveIntersection64                 21900 ns        21815 ns        31985 GHz=4.68961 cycles=99.068k instructions=450.278k
SuccessiveIntersectionCardinality        13237 ns        13187 ns        53671 GHz=4.79969 cycles=61.196k instructions=225.372k
SuccessiveIntersectionCardinality64      16146 ns        16080 ns        43595 GHz=4.74067 cycles=75.059k instructions=295.335k
SuccessiveUnionCardinality               16252 ns        16199 ns        43612 GHz=4.77533 cycles=73.817k instructions=281.402k
SuccessiveUnionCardinality64             33258 ns        33143 ns        21267 GHz=4.62926 cycles=146.59k instructions=688.594k
SuccessiveDifferenceCardinality          14933 ns        14879 ns        47248 GHz=4.79019 cycles=68.059k instructions=254.548k
SuccessiveDifferenceCardinality64        24873 ns        24766 ns        28327 GHz=4.67021 cycles=111.109k instructions=493.63k
SuccessiveUnion                         226659 ns       225770 ns         3077 GHz=4.53387 cycles=926.042k instructions=5.60829M
SuccessiveUnion64                       338843 ns       337749 ns         2077 GHz=4.52556 cycles=1.38765M instructions=8.16634M
TotalUnion                              565838 ns       563770 ns         1244 GHz=4.52126 cycles=2.41774M instructions=9.26225M
TotalUnionHeap                         1081140 ns      1077017 ns          637 GHz=4.51637 cycles=4.52596M instructions=16.2751M
RandomAccess                              2120 ns         2112 ns       325915 GHz=6.00471 cycles=12.76k instructions=48.884k
RandomAccess64                            3194 ns         3182 ns       214578 GHz=6.05899 cycles=14.893k instructions=65.999k
RandomAccess64Cpp                         1912 ns         1905 ns       369744 GHz=6.32775 cycles=11.333k instructions=54.684k
ToArray                                 285813 ns       284804 ns         2485 GHz=4.52746 cycles=1.1943M instructions=7.90309M
ToArray64                               139914 ns       139335 ns         5002 GHz=4.54549 cycles=600.005k instructions=2.17751M
IterateAll                             2037133 ns      1969447 ns          365 GHz=4.38105 cycles=7.92915M instructions=53.4871M
IterateAll64                           2472474 ns      2463372 ns          285 GHz=4.51945 cycles=10.4848M instructions=55.7099M
ComputeCardinality                        1654 ns         1648 ns       438434 GHz=6.65267 cycles=9.979k instructions=45.089k
ComputeCardinality64                      8809 ns         8772 ns        77972 GHz=4.97056 cycles=37.488k instructions=213.628k
RankManySlow                              5828 ns         5804 ns       120097 GHz=5.11476 cycles=27.277k instructions=147.888k
RankMany                                  3370 ns         3357 ns       205871 GHz=5.55163 cycles=17.582k instructions=78.854k

current commit:

----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
SuccessiveIntersection                   19579 ns        19507 ns        36376 GHz=4.71563 cycles=89.399k instructions=413.7k
SuccessiveIntersection64                 23842 ns        23761 ns        29373 GHz=4.69042 cycles=108.269k instructions=453.339k
SuccessiveIntersectionCardinality        13404 ns        13311 ns        52163 GHz=4.79626 cycles=62.15k instructions=225.352k
SuccessiveIntersectionCardinality64      17194 ns        17139 ns        41686 GHz=4.74824 cycles=77.553k instructions=304.35k
SuccessiveUnionCardinality               16610 ns        16550 ns        43041 GHz=4.74207 cycles=75.086k instructions=280.593k
SuccessiveUnionCardinality64             33043 ns        32933 ns        21258 GHz=4.64025 cycles=147.328k instructions=699.416k
SuccessiveDifferenceCardinality          15257 ns        15205 ns        46976 GHz=4.77364 cycles=69.614k instructions=254.16k
SuccessiveDifferenceCardinality64        25111 ns        25015 ns        27998 GHz=4.67393 cycles=113.539k instructions=503.56k
SuccessiveUnion                         238172 ns       228244 ns         3176 GHz=4.53286 cycles=944.159k instructions=5.61135M
SuccessiveUnion64                       351564 ns       350277 ns         2018 GHz=4.52563 cycles=1.44537M instructions=8.23771M
TotalUnion                              562060 ns       560109 ns         1251 GHz=4.521 cycles=2.40461M instructions=9.26517M
TotalUnionHeap                         1036921 ns      1033062 ns          675 GHz=4.51789 cycles=4.33566M instructions=16.0435M
RandomAccess                              2130 ns         2123 ns       328224 GHz=6.03576 cycles=12.826k instructions=48.88k
RandomAccess64                            2992 ns         2982 ns       239178 GHz=5.9226 cycles=14.309k instructions=71.436k
RandomAccess64Cpp                         1892 ns         1885 ns       371878 GHz=6.37207 cycles=11.68k instructions=54.68k
ToArray                                 282170 ns       281108 ns         2483 GHz=4.52505 cycles=1.18613M instructions=7.90292M
ToArray64                               139804 ns       139304 ns         4983 GHz=4.54157 cycles=608.38k instructions=2.18618M
IterateAll                             1961577 ns      1953646 ns          367 GHz=4.49 cycles=7.95965M instructions=53.4959M
IterateAll64                           2701690 ns      2692246 ns          260 GHz=4.50989 cycles=11.4825M instructions=60.731M
ComputeCardinality                        1619 ns         1613 ns       432168 GHz=6.58916 cycles=9.607k instructions=44.677k
ComputeCardinality64                      8400 ns         8367 ns        82885 GHz=4.95302 cycles=38.589k instructions=214.738k
RankManySlow                              5714 ns         5695 ns       123625 GHz=5.10898 cycles=26.393k instructions=146.956k
RankMany                                  2916 ns         2906 ns       238752 GHz=5.71345 cycles=15.712k instructions=78.507k

(The frequency is unreliable, but the instruction counts should be fine.)

You can see that both RandomAccess64 and IterateAll64 use many more instructions.

You can see run-to-run differences, but if the number of instructions is stable, these small differences are likely not significant.

lemire avatar Oct 08 '25 19:10 lemire

I remembered how to use Google Bench's compare.py and did a comparison of 20 runs and confirm your findings above about where the issue occurred. I'm seeing (in my benchmarks) about a 15% slowdown in RandomAccess and Iterate between versions 4.2.3 and 4.3.0... which unfortunately for me constitute most of my usage since I use Roaring primarily as a sparse bitmap to track operations across data elements: I rarely need to do things like union or difference bitmaps. I don't even unset bits all that often.

I did try with sudo but I realized I was running my tests on a VMWare instance (but the performance is quite stable) and it can't access the raw CPU counters. I'll re-do the tests and check out your script, on my local Linux system later today or tomorrow.

madscientist avatar Oct 08 '25 21:10 madscientist

To be clear, my belief is that this affects IterateAll64 and RandomAccess64. I think that IterateAll and RandomAccess are unaffected.

I trust that @SLieve is likely right that it is due to https://github.com/RoaringBitmap/CRoaring/pull/688 although I have not investigated the issue further.

lemire avatar Oct 08 '25 23:10 lemire

Thanks @lemire I agree with your findings. My original tests that seemed to show a wider slowdown weren't representative and those results couldn't be reproduced when I did more testing. But agreed that the iterate / random access tests for 64bit are definitely slower.

Is this still an issue that is going to be investigated, perhaps by @SLieve , to try to improve? Or is it considered an acceptable trade-off for the advantages that #688 is giving such as speeding up finding the next free index?

Thanks!

madscientist avatar Oct 09 '25 14:10 madscientist

@madscientist We should understand why there is a performance degradation and what is actionable.

IterateAll64:

  41.74%  bench    libroaring.so.4.0.0    [.] roaring64_iterator_advance                                                                                                     ◆
  21.24%  bench    libroaring.so.4.0.0    [.] container_iterator_next                                                                                                        ▒
  15.52%  bench    bench                  [.] void BasicBench<iterate_all64>(benchmark::State&)                                                                              ▒
   5.20%  bench    bench                  [.] roaring64_iterator_advance@plt                                                                                                 ▒
   5.14%  bench    libroaring.so.4.0.0    [.] roaring64_iterator_has_value                                                                                                   ▒
   5.14%  bench    libroaring.so.4.0.0    [.] container_iterator_next@plt

RandomAccess64:

  28.90%  bench    libroaring.so.4.0.0   [.] roaring_bitmap_contains
  24.65%  bench    libroaring.so.4.0.0   [.] art_find
  19.24%  bench    libroaring.so.4.0.0   [.] roaring64_bitmap_contains
  14.68%  bench    bench                 [.] void BasicBench<random_access64_cpp>(benchmark::State&)

I would expect more inlining. I will open an issue.

lemire avatar Oct 09 '25 15:10 lemire

So I have opened an issue...

https://github.com/RoaringBitmap/CRoaring/issues/755

There are three functions that I believe should be candidate for inlining. They are not currently. This might help with this issue.

lemire avatar Oct 09 '25 15:10 lemire

If you look at a typical execution of roaring64_iterator_advance, it almost always call directly container_iterator_next. So we have a cheap function calling another cheap function. This makes is a good candidate for optimization by inlining.

In principle, link-time optimization could help, but it does not seem to happen (and that's expected).

lemire avatar Oct 09 '25 15:10 lemire

PR https://github.com/RoaringBitmap/CRoaring/pull/766 should help.

Basically, I am moving to make more code inlineable.

lemire avatar Nov 26 '25 23:11 lemire