Perf regression from 4.1.2 to 4.4.2?
I found an issue with the iterator in 4.1.2 that was fixed in 4.4.2, so I was going to upgrade. But, I noticed in our microbenchmarks that 4.4.2 was slower than 4.1.2. I checked with the CRoaring microbenchmarks and can see a similar slowdown, almost across the board.
Is this a known regression that we are accepting for other reasons, or is it not expected? FYI, I'm compiling with GCC 14.2.0 on Intel(R) Xeon Gold 6248R CPU @ 3.00GHz, default flags from the CRoaring CMake environment.
Here is a comparison using the default microbenchmark: I used some Emacs rectangle kill/yank foo to add the comparison (I know there's some fancy tooling to compare Google Bench output but I forget how to do it :) ) You can see that many of the values are worse for 4.4.2: fewer iterations / longer CPU time:
2025-10-08T11:49:06-04:00
Running ./microbenchmarks/bench
Run on (12 X 3000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x12)
L1 Instruction 32 KiB (x12)
L2 Unified 1024 KiB (x12)
L3 Unified 36608 KiB (x12)
Load Average: 0.24, 0.37, 0.26
AVX-2 hardware: yes
AVX-512: supported by compiler
AVX-512 hardware: no
In RAM volume in MiB (estimated): 1.802830
benchmarking other files: You may pass is a data directory as a parameter.
data source: /data/psh13/src/CRoaring/benchmarks/realdata/census1881
number of bitmaps: 200
performance counters: No privileged access (sudo may help).
x64: detected
4.1.2 4.4.2
----------------------------------------------------------------------------------------
Benchmark Time Iterations Time Iterations
----------------------------------------------------------------------------------------
SuccessiveIntersection 28848 ns 24252 29228 ns 23898
SuccessiveIntersection64 46578 ns 15024 49466 ns 14136
SuccessiveIntersectionCardinality 26155 ns 26779 27928 ns 25011
SuccessiveIntersectionCardinality64 42673 ns 16406 43998 ns 15883
SuccessiveUnionCardinality 36853 ns 18958 36438 ns 19159
SuccessiveUnionCardinality64 97501 ns 7187 100082 ns 6984
SuccessiveDifferenceCardinality 32514 ns 21540 33503 ns 20863
SuccessiveDifferenceCardinality64 69695 ns 10041 72552 ns 9645
SuccessiveUnion 573901 ns 1227 558773 ns 1255
SuccessiveUnion64 935391 ns 745 931132 ns 751
TotalUnion 637562 ns 1097 633914 ns 1103
TotalUnionHeap 2014668 ns 347 1952174 ns 358
RandomAccess 3531 ns 196919 3514 ns 199751
RandomAccess64 5811 ns 121117 6394 ns 109562
RandomAccess64Cpp 4500 ns 132872 4554 ns 153998
ToArray 148464 ns 4702 148486 ns 4697
ToArray64 604936 ns 1156 615433 ns 1134
IterateAll 3425703 ns 204 3431836 ns 204
IterateAll64 5007200 ns 142 5078060 ns 139
ComputeCardinality 3928 ns 179276 3638 ns 191939
ComputeCardinality64 22224 ns 31109 24014 ns 29137
RankManySlow 16732 ns 41747 15564 ns 44669
RankMany 6382 ns 109172 6171 ns 111491
A few things (Union tests for example) are faster but a number are slower, particularly 64bit (which is, unfortunately, mostly what I use).
If this is interesting / new info I can do some bisecting across versions to get more details. If this is known / understood then I won't bother.
This is likely from PR #688, as it significantly restructured the 64-bit internals. In case you're interested, that PR also added some new synthetic benchmarks which focus more on higher-order bits, see microbenchmarks/synthetic_bench.cpp and the commit message of https://github.com/RoaringBitmap/CRoaring/commit/f30bfa90848edcd446dc674ae8f71a040f72067a for a sample benchmark run.
Please run the following Python script: https://gist.github.com/lemire/fabe642b256987f94695c03b3d7c2515
Running my own script I do spot a significant regression on the IterateAll64 benchmark which is not part of your results.
And a small degradation in RandomAccess64.
I do not see anything else that has degraded.
I checked with the CRoaring microbenchmarks and can see a similar slowdown, almost across the board.
Please be careful. A simple frequency change could explain the difference. Caching issues and other issues also impact the performance.
Please report the results with performance counters...
You can get the desired results under macOS or Linux, often by running the benchmark under sudo (privileged mode).
Here are my results....
4.1.2:
----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------
SuccessiveIntersection 19879 ns 19808 ns 36340 GHz=4.73744 cycles=88.628k instructions=412.548k
SuccessiveIntersection64 21900 ns 21815 ns 31985 GHz=4.68961 cycles=99.068k instructions=450.278k
SuccessiveIntersectionCardinality 13237 ns 13187 ns 53671 GHz=4.79969 cycles=61.196k instructions=225.372k
SuccessiveIntersectionCardinality64 16146 ns 16080 ns 43595 GHz=4.74067 cycles=75.059k instructions=295.335k
SuccessiveUnionCardinality 16252 ns 16199 ns 43612 GHz=4.77533 cycles=73.817k instructions=281.402k
SuccessiveUnionCardinality64 33258 ns 33143 ns 21267 GHz=4.62926 cycles=146.59k instructions=688.594k
SuccessiveDifferenceCardinality 14933 ns 14879 ns 47248 GHz=4.79019 cycles=68.059k instructions=254.548k
SuccessiveDifferenceCardinality64 24873 ns 24766 ns 28327 GHz=4.67021 cycles=111.109k instructions=493.63k
SuccessiveUnion 226659 ns 225770 ns 3077 GHz=4.53387 cycles=926.042k instructions=5.60829M
SuccessiveUnion64 338843 ns 337749 ns 2077 GHz=4.52556 cycles=1.38765M instructions=8.16634M
TotalUnion 565838 ns 563770 ns 1244 GHz=4.52126 cycles=2.41774M instructions=9.26225M
TotalUnionHeap 1081140 ns 1077017 ns 637 GHz=4.51637 cycles=4.52596M instructions=16.2751M
RandomAccess 2120 ns 2112 ns 325915 GHz=6.00471 cycles=12.76k instructions=48.884k
RandomAccess64 3194 ns 3182 ns 214578 GHz=6.05899 cycles=14.893k instructions=65.999k
RandomAccess64Cpp 1912 ns 1905 ns 369744 GHz=6.32775 cycles=11.333k instructions=54.684k
ToArray 285813 ns 284804 ns 2485 GHz=4.52746 cycles=1.1943M instructions=7.90309M
ToArray64 139914 ns 139335 ns 5002 GHz=4.54549 cycles=600.005k instructions=2.17751M
IterateAll 2037133 ns 1969447 ns 365 GHz=4.38105 cycles=7.92915M instructions=53.4871M
IterateAll64 2472474 ns 2463372 ns 285 GHz=4.51945 cycles=10.4848M instructions=55.7099M
ComputeCardinality 1654 ns 1648 ns 438434 GHz=6.65267 cycles=9.979k instructions=45.089k
ComputeCardinality64 8809 ns 8772 ns 77972 GHz=4.97056 cycles=37.488k instructions=213.628k
RankManySlow 5828 ns 5804 ns 120097 GHz=5.11476 cycles=27.277k instructions=147.888k
RankMany 3370 ns 3357 ns 205871 GHz=5.55163 cycles=17.582k instructions=78.854k
current commit:
----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------
SuccessiveIntersection 19579 ns 19507 ns 36376 GHz=4.71563 cycles=89.399k instructions=413.7k
SuccessiveIntersection64 23842 ns 23761 ns 29373 GHz=4.69042 cycles=108.269k instructions=453.339k
SuccessiveIntersectionCardinality 13404 ns 13311 ns 52163 GHz=4.79626 cycles=62.15k instructions=225.352k
SuccessiveIntersectionCardinality64 17194 ns 17139 ns 41686 GHz=4.74824 cycles=77.553k instructions=304.35k
SuccessiveUnionCardinality 16610 ns 16550 ns 43041 GHz=4.74207 cycles=75.086k instructions=280.593k
SuccessiveUnionCardinality64 33043 ns 32933 ns 21258 GHz=4.64025 cycles=147.328k instructions=699.416k
SuccessiveDifferenceCardinality 15257 ns 15205 ns 46976 GHz=4.77364 cycles=69.614k instructions=254.16k
SuccessiveDifferenceCardinality64 25111 ns 25015 ns 27998 GHz=4.67393 cycles=113.539k instructions=503.56k
SuccessiveUnion 238172 ns 228244 ns 3176 GHz=4.53286 cycles=944.159k instructions=5.61135M
SuccessiveUnion64 351564 ns 350277 ns 2018 GHz=4.52563 cycles=1.44537M instructions=8.23771M
TotalUnion 562060 ns 560109 ns 1251 GHz=4.521 cycles=2.40461M instructions=9.26517M
TotalUnionHeap 1036921 ns 1033062 ns 675 GHz=4.51789 cycles=4.33566M instructions=16.0435M
RandomAccess 2130 ns 2123 ns 328224 GHz=6.03576 cycles=12.826k instructions=48.88k
RandomAccess64 2992 ns 2982 ns 239178 GHz=5.9226 cycles=14.309k instructions=71.436k
RandomAccess64Cpp 1892 ns 1885 ns 371878 GHz=6.37207 cycles=11.68k instructions=54.68k
ToArray 282170 ns 281108 ns 2483 GHz=4.52505 cycles=1.18613M instructions=7.90292M
ToArray64 139804 ns 139304 ns 4983 GHz=4.54157 cycles=608.38k instructions=2.18618M
IterateAll 1961577 ns 1953646 ns 367 GHz=4.49 cycles=7.95965M instructions=53.4959M
IterateAll64 2701690 ns 2692246 ns 260 GHz=4.50989 cycles=11.4825M instructions=60.731M
ComputeCardinality 1619 ns 1613 ns 432168 GHz=6.58916 cycles=9.607k instructions=44.677k
ComputeCardinality64 8400 ns 8367 ns 82885 GHz=4.95302 cycles=38.589k instructions=214.738k
RankManySlow 5714 ns 5695 ns 123625 GHz=5.10898 cycles=26.393k instructions=146.956k
RankMany 2916 ns 2906 ns 238752 GHz=5.71345 cycles=15.712k instructions=78.507k
(The frequency is unreliable, but the instruction counts should be fine.)
You can see that both RandomAccess64 and IterateAll64 use many more instructions.
You can see run-to-run differences, but if the number of instructions is stable, these small differences are likely not significant.
I remembered how to use Google Bench's compare.py and did a comparison of 20 runs and confirm your findings above about where the issue occurred. I'm seeing (in my benchmarks) about a 15% slowdown in RandomAccess and Iterate between versions 4.2.3 and 4.3.0... which unfortunately for me constitute most of my usage since I use Roaring primarily as a sparse bitmap to track operations across data elements: I rarely need to do things like union or difference bitmaps. I don't even unset bits all that often.
I did try with sudo but I realized I was running my tests on a VMWare instance (but the performance is quite stable) and it can't access the raw CPU counters. I'll re-do the tests and check out your script, on my local Linux system later today or tomorrow.
To be clear, my belief is that this affects IterateAll64 and RandomAccess64. I think that IterateAll and RandomAccess are unaffected.
I trust that @SLieve is likely right that it is due to https://github.com/RoaringBitmap/CRoaring/pull/688 although I have not investigated the issue further.
Thanks @lemire I agree with your findings. My original tests that seemed to show a wider slowdown weren't representative and those results couldn't be reproduced when I did more testing. But agreed that the iterate / random access tests for 64bit are definitely slower.
Is this still an issue that is going to be investigated, perhaps by @SLieve , to try to improve? Or is it considered an acceptable trade-off for the advantages that #688 is giving such as speeding up finding the next free index?
Thanks!
@madscientist We should understand why there is a performance degradation and what is actionable.
IterateAll64:
41.74% bench libroaring.so.4.0.0 [.] roaring64_iterator_advance ◆
21.24% bench libroaring.so.4.0.0 [.] container_iterator_next ▒
15.52% bench bench [.] void BasicBench<iterate_all64>(benchmark::State&) ▒
5.20% bench bench [.] roaring64_iterator_advance@plt ▒
5.14% bench libroaring.so.4.0.0 [.] roaring64_iterator_has_value ▒
5.14% bench libroaring.so.4.0.0 [.] container_iterator_next@plt
RandomAccess64:
28.90% bench libroaring.so.4.0.0 [.] roaring_bitmap_contains
24.65% bench libroaring.so.4.0.0 [.] art_find
19.24% bench libroaring.so.4.0.0 [.] roaring64_bitmap_contains
14.68% bench bench [.] void BasicBench<random_access64_cpp>(benchmark::State&)
I would expect more inlining. I will open an issue.
So I have opened an issue...
https://github.com/RoaringBitmap/CRoaring/issues/755
There are three functions that I believe should be candidate for inlining. They are not currently. This might help with this issue.
If you look at a typical execution of roaring64_iterator_advance, it almost always call directly container_iterator_next. So we have a cheap function calling another cheap function. This makes is a good candidate for optimization by inlining.
In principle, link-time optimization could help, but it does not seem to happen (and that's expected).
PR https://github.com/RoaringBitmap/CRoaring/pull/766 should help.
Basically, I am moving to make more code inlineable.