benchmark [BUG] Rate counters are per-thread in multi-threaded benchmarks, kAvgThreadsRate does not make sense

Describe the bug

When using counters that represent a rate (benchmark::Counter::kIsRate), the rate appears to be computed per thread. Subsequently, when using kAvgThreadsRate, the rate is computed per thread and then divided by the number of threads (again). This is misleading.

System

OS

ProductName:		macOS
ProductVersion:		15.6.1
BuildVersion:		24G90

Compiler

Apple clang version 17.0.0 (clang-1700.0.13.5)
Target: arm64-apple-darwin24.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Library

FetchContent_Declare(
        benchmark
        GIT_REPOSITORY https://github.com/google/benchmark.git
        GIT_TAG v1.9.4
)

To reproduce

Write a benchmark that simply sleeps for 1 second, then increments a counter by 1. With 1 thread, we expect a counter of 1, and a rate of 1/s. With 10 threads, we expect a counter of 10 and a rate of 10/s. However, the rate is reported as 1/s and the per-thread rate as 0.1/s. This seems wrong?

static void BM_IntegerAddition(benchmark::State& state) {
    for (auto _ : state) {
        std::this_thread::sleep_for(std::chrono::seconds(1));
        benchmark::DoNotOptimize(1 + 2);
    }
    state.counters["counter"] = benchmark::Counter(1);
    state.counters["counter_rate"] = benchmark::Counter(1, benchmark::Counter::kIsRate);
    state.counters["counter_thread_rate"] = benchmark::Counter(1, benchmark::Counter::kAvgThreadsRate);
}

BENCHMARK(BM_IntegerAddition)
    ->Threads(1)
    ->Threads(10)
    ->Iterations(1)
    ->UseRealTime();

---------------------------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------
BM_IntegerAddition/iterations:1/real_time/threads:1  1005048541 ns        29000 ns            1 counter=1 counter_rate=0.994977/s counter_thread_rate=0.994977/s
BM_IntegerAddition/iterations:1/real_time/threads:10 1002581584 ns        36300 ns           10 counter=10 counter_rate=0.997425/s counter_thread_rate=0.0997425/s

Expected behavior

---------------------------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------
BM_IntegerAddition/iterations:1/real_time/threads:1  1005048541 ns        29000 ns            1 counter=1 counter_rate=0.994977/s counter_thread_rate=0.994977/s
BM_IntegerAddition/iterations:1/real_time/threads:10 1002581584 ns        36300 ns           10 counter=10 counter_rate=9.97425/s counter_thread_rate=0.997425/s

Nov 26 '25 16:11 FRosner

The issue does not seem to mention, which version of library is that?

Nov 26 '25 16:11 LebedevRI

Sorry @LebedevRI! I updated the description: 1.9.4.

Nov 26 '25 16:11 FRosner

Ah yes, i've seen this before. That whole ->Threads() thing is weird. It's essentally for modelling separate workers. For the purpose of counters and timers, they run sequentially, not in parallel. https://github.com/google/benchmark/blob/2279f2acc8f2ca1bb51195a12966411bed31b26b/src/benchmark_runner.cc#L150-L159

I guess this was essentially "broken" by https://github.com/google/benchmark/pull/1836, except this breakage was intentional and long-desired.

So basically, if you want to model parallel data processing, you basically can't use ->Threads() interface, you need to do forking internally.

One QOL improvement that is missing - adding an option to turn the ->Threads() into an information-only flag, that would disable in-library forking, and the benchmark then could do what it wishes with the threads() value it was provided.

Nov 26 '25 17:11 LebedevRI

Thanks for the pointers!

That whole ->Threads() thing is weird. It's essentally for modelling separate workers. For the purpose of counters and timers, they run sequentially, not in parallel.

I am not sure I am following. The mutex you linked synchronizes the updates of the global counter with the per-thread counter values, right? AFAICT the individual benchmark runs are concurrent. If the threads would run sequentially, there would be no point, right?

I guess this was essentially "broken" by https://github.com/google/benchmark/pull/1836

Yes that appears to be a regression. I guess the problem is that when aggregating the user counters, we pass i.seconds as the cpu_time, which is now the total time passed by each thread, summed together:

https://github.com/google/benchmark/blob/2279f2acc8f2ca1bb51195a12966411bed31b26b/src/counter.cc#L25-L27

Now without reverting https://github.com/google/benchmark/pull/1836, which we obviously shouldn't do, can't we fix the user counters by multiplying with the number of threads for kIsRate, and dividing by the number of threads (which we do already) for kAvgThreads? Then rates would be global and rates per thread would be per thread as expected.

If you agree, I'm happy to submit a PR. cc @dmah42

Nov 26 '25 21:11 FRosner

Thanks for the pointers!

That whole ->Threads() thing is weird. It's essentally for modelling separate workers. For the purpose of counters and timers, they run sequentially, not in parallel.

I am not sure I am following. The mutex you linked synchronizes the updates of the global counter with the per-thread counter values, right? AFAICT the individual benchmark runs are concurrent. If the threads would run sequentially, there would be no point, right?

I guess this was essentially "broken" by #1836

I guess the problem is that when aggregating the user counters, we pass i.seconds as the cpu_time, which is now the total time passed by each thread, summed together:

Look at the diff that was linked/showed directly after that statement.

benchmark/src/counter.cc

Lines 25 to 27 in 2279f2a

if ((c.flags & Counter::kIsRate) != 0) { v /= cpu_time; } Now without reverting #1836, which obviously makes sense, can't we fix the user counters by multiplying with the number of threads for kIsRate, and dividing by the number of threads (which we do already) for kAvgThreads? Then rates would be global and rates per thread would be per thread as expected.

If you agree, I'm happy to submit a PR. cc @dmah42

Nov 26 '25 21:11 LebedevRI

Now without reverting https://github.com/google/benchmark/pull/1836, which we obviously shouldn't do, can't we fix the user counters by multiplying with the number of threads for kIsRate, and dividing by the number of threads (which we do already) for kAvgThreads? Then rates would be global and rates per thread would be per thread as expected.

Here's what I mean: https://github.com/google/benchmark/pull/2081

Nov 27 '25 07:11 FRosner

Let's look at different scenarios with threads.

Scenario 1: Both threads are fully parallel,

|----------|
|----------|

Walltime and [CPU] Time for each is 1 s, each produced 1 counter unit. Total Walltime and [CPU] Time is 2s, produced 2 counter units. Currently, we'll say that that rate is 2unit/2sec = 1unit/s. Which is correct for each, i.e. average, thread here. With this patch, we'll say that that rate is (2unit/(2sec/2threads)) = (2unit/(1sec/thread)) = 2unit*thread/sec Which is correct as the collective all-thread performance. (FIXME: units do not look correct there.)

Scenario 2: Both threads are sequential,

|----------00000000000000|
|0000000----------0000000|
|00000000000000----------|

Walltime for each is 3 s, CPU time for each is 1s, each produced 1 counter unit. If we use CPU time as time metric, this is identical to Scenario 1. Let's look at Walltime. Total Walltime is 9s, produced 3 counter units. Currently, we'll say that that rate is 3unit/9sec = (1/3)unit/s. Which is correct for each, i.e. average, thread here. With this patch, we'll say that that rate is (3unit/(9sec/3threads)) = (3unit/(3sec/thread)) = 1unit*thread/sec Which is correct as the collective all-thread performance. (FIXME: units do not look correct there.)

Am i correct so far?

So essentially you are proposing to output, instead of average thread rate, the collective thread rate. The problem is, my understanding of that Threads() feature is, it is mostly used to measure how well the code-under-benchmark scales with the number of threads, and currently, doubling the number of threads, in both of these scenarios, would not change the rate, and thus one can directly visually compare them, but with this patch, it would linearly scale with number of threads. Of course, there is kAvgThreads, but the existing code will be broken.

@dmah42 is my logic correct here?

Nov 27 '25 15:11 LebedevRI

So essentially you are proposing to output, instead of average thread rate, the collective thread rate.

Exactly. That's what I would expect based on the documentation already, and from my understanding this is how it worked before https://github.com/google/benchmark/pull/1836, right? Here are the relevant docs:

// Set the counter as a rate. It will be presented divided
// by the duration of the benchmark.
// Meaning: per one second, how many 'foo's are processed?
state.counters["FooRate"] = Counter(numFoos, benchmark::Counter::kIsRate);

// Set the counter as a thread-average quantity. It will
// be presented divided by the number of threads.
state.counters["FooAvg"] = Counter(numFoos, benchmark::Counter::kAvgThreads);

// There's also a combined flag:
state.counters["FooAvgRate"] = Counter(numFoos,benchmark::Counter::kAvgThreadsRate);

The problem is, my understanding of that Threads() feature is, it is mostly used to measure how well the code-under-benchmark scales with the number of threads, and currently, doubling the number of threads, in both of these scenarios, would not change the rate, and thus one can directly visually compare them, but with this patch, it would linearly scale with number of threads. Of course, there is kAvgThreads, but the existing code will be broken.

It depends on what you want to measure. If you are interested at the per-thread throughput or global throughput. I would argue that existing code is already broken because kAvgThreadsRate is incorrect (it is dividing by the number of threads twice, and does not mean anything) and kIsRate works differently than before. To me, this is a bug that can be fixed without introducing a new configuration. WDYT?

but with this patch, it would linearly scale with number of threads

If your algorithm scales linearly, yes 😉

Nov 27 '25 15:11 FRosner

I guess i'm just mainly saddened that we are breaking the behaviour all the time.

Nov 27 '25 16:11 LebedevRI

Alright. So @LebedevRI @dmah42 once you both agree that my proposal in https://github.com/google/benchmark/pull/2081 makes sense, I'll look into getting the tests adjusted and preparing the PR for review.

Nov 28 '25 09:11 FRosner

After sleeping on it, i do suppose that proposed behaviour may be better, yes.

Nov 29 '25 00:11 LebedevRI

i think this is a positive change. having said that, i think in a year or two someone will come along with a convincing argument why this isn't the right thing to do and we should do X where X is slightly different, as i don't think there's one "right answer". so if we all agree that this PR is the right answer (now) then can we please also document this behaviour thoroughly in markdown in the docs area so we can point any future well-meaning PRs at that document.

Dec 01 '25 11:12 dmah42

so if we all agree that this PR is the right answer (now) then can we please also document this behaviour thoroughly in markdown in the docs area so we can point any future well-meaning PRs at that document

Will do!

Dec 01 '25 14:12 FRosner