[BUG] Rate counters are per-thread in multi-threaded benchmarks, kAvgThreadsRate does not make sense
Describe the bug
When using counters that represent a rate (benchmark::Counter::kIsRate), the rate appears to be computed per thread. Subsequently, when using kAvgThreadsRate, the rate is computed per thread and then divided by the number of threads (again). This is misleading.
System
OS
ProductName: macOS
ProductVersion: 15.6.1
BuildVersion: 24G90
Compiler
Apple clang version 17.0.0 (clang-1700.0.13.5)
Target: arm64-apple-darwin24.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
Library
FetchContent_Declare(
benchmark
GIT_REPOSITORY https://github.com/google/benchmark.git
GIT_TAG v1.9.4
)
To reproduce
Write a benchmark that simply sleeps for 1 second, then increments a counter by 1. With 1 thread, we expect a counter of 1, and a rate of 1/s. With 10 threads, we expect a counter of 10 and a rate of 10/s. However, the rate is reported as 1/s and the per-thread rate as 0.1/s. This seems wrong?
static void BM_IntegerAddition(benchmark::State& state) {
for (auto _ : state) {
std::this_thread::sleep_for(std::chrono::seconds(1));
benchmark::DoNotOptimize(1 + 2);
}
state.counters["counter"] = benchmark::Counter(1);
state.counters["counter_rate"] = benchmark::Counter(1, benchmark::Counter::kIsRate);
state.counters["counter_thread_rate"] = benchmark::Counter(1, benchmark::Counter::kAvgThreadsRate);
}
BENCHMARK(BM_IntegerAddition)
->Threads(1)
->Threads(10)
->Iterations(1)
->UseRealTime();
---------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------
BM_IntegerAddition/iterations:1/real_time/threads:1 1005048541 ns 29000 ns 1 counter=1 counter_rate=0.994977/s counter_thread_rate=0.994977/s
BM_IntegerAddition/iterations:1/real_time/threads:10 1002581584 ns 36300 ns 10 counter=10 counter_rate=0.997425/s counter_thread_rate=0.0997425/s
Expected behavior
---------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------
BM_IntegerAddition/iterations:1/real_time/threads:1 1005048541 ns 29000 ns 1 counter=1 counter_rate=0.994977/s counter_thread_rate=0.994977/s
BM_IntegerAddition/iterations:1/real_time/threads:10 1002581584 ns 36300 ns 10 counter=10 counter_rate=9.97425/s counter_thread_rate=0.997425/s
The issue does not seem to mention, which version of library is that?
Sorry @LebedevRI! I updated the description: 1.9.4.
Ah yes, i've seen this before. That whole ->Threads() thing is weird. It's essentally for modelling separate workers.
For the purpose of counters and timers, they run sequentially, not in parallel.
https://github.com/google/benchmark/blob/2279f2acc8f2ca1bb51195a12966411bed31b26b/src/benchmark_runner.cc#L150-L159
I guess this was essentially "broken" by https://github.com/google/benchmark/pull/1836, except this breakage was intentional and long-desired.
So basically,
if you want to model parallel data processing,
you basically can't use ->Threads() interface,
you need to do forking internally.
One QOL improvement that is missing - adding an option to turn
the ->Threads() into an information-only flag,
that would disable in-library forking, and the benchmark then
could do what it wishes with the threads() value it was provided.
Thanks for the pointers!
That whole
->Threads()thing is weird. It's essentally for modelling separate workers. For the purpose of counters and timers, they run sequentially, not in parallel.
I am not sure I am following. The mutex you linked synchronizes the updates of the global counter with the per-thread counter values, right? AFAICT the individual benchmark runs are concurrent. If the threads would run sequentially, there would be no point, right?
I guess this was essentially "broken" by https://github.com/google/benchmark/pull/1836
Yes that appears to be a regression. I guess the problem is that when aggregating the user counters, we pass i.seconds as the cpu_time, which is now the total time passed by each thread, summed together:
https://github.com/google/benchmark/blob/2279f2acc8f2ca1bb51195a12966411bed31b26b/src/counter.cc#L25-L27
Now without reverting https://github.com/google/benchmark/pull/1836, which we obviously shouldn't do, can't we fix the user counters by multiplying with the number of threads for kIsRate, and dividing by the number of threads (which we do already) for kAvgThreads? Then rates would be global and rates per thread would be per thread as expected.
If you agree, I'm happy to submit a PR. cc @dmah42
Thanks for the pointers!
That whole
->Threads()thing is weird. It's essentally for modelling separate workers. For the purpose of counters and timers, they run sequentially, not in parallel.I am not sure I am following. The mutex you linked synchronizes the updates of the global counter with the per-thread counter values, right? AFAICT the individual benchmark runs are concurrent. If the threads would run sequentially, there would be no point, right?
I guess this was essentially "broken" by #1836
I guess the problem is that when aggregating the user counters, we pass
i.secondsas thecpu_time, which is now the total time passed by each thread, summed together:
Look at the diff that was linked/showed directly after that statement.
Lines 25 to 27 in 2279f2a
if ((c.flags & Counter::kIsRate) != 0) { v /= cpu_time; } Now without reverting #1836, which obviously makes sense, can't we fix the user counters by multiplying with the number of threads for
kIsRate, and dividing by the number of threads (which we do already) forkAvgThreads? Then rates would be global and rates per thread would be per thread as expected.If you agree, I'm happy to submit a PR. cc @dmah42
Now without reverting https://github.com/google/benchmark/pull/1836, which we obviously shouldn't do, can't we fix the user counters by multiplying with the number of threads for kIsRate, and dividing by the number of threads (which we do already) for kAvgThreads? Then rates would be global and rates per thread would be per thread as expected.
Here's what I mean: https://github.com/google/benchmark/pull/2081
Let's look at different scenarios with threads.
Scenario 1: Both threads are fully parallel,
|----------|
|----------|
Walltime and [CPU] Time for each is 1 s, each produced 1 counter unit. Total Walltime and [CPU] Time is 2s, produced 2 counter units. Currently, we'll say that that rate is 2unit/2sec = 1unit/s. Which is correct for each, i.e. average, thread here. With this patch, we'll say that that rate is (2unit/(2sec/2threads)) = (2unit/(1sec/thread)) = 2unit*thread/sec Which is correct as the collective all-thread performance. (FIXME: units do not look correct there.)
Scenario 2: Both threads are sequential,
|----------00000000000000|
|0000000----------0000000|
|00000000000000----------|
Walltime for each is 3 s, CPU time for each is 1s, each produced 1 counter unit. If we use CPU time as time metric, this is identical to Scenario 1. Let's look at Walltime. Total Walltime is 9s, produced 3 counter units. Currently, we'll say that that rate is 3unit/9sec = (1/3)unit/s. Which is correct for each, i.e. average, thread here. With this patch, we'll say that that rate is (3unit/(9sec/3threads)) = (3unit/(3sec/thread)) = 1unit*thread/sec Which is correct as the collective all-thread performance. (FIXME: units do not look correct there.)
Am i correct so far?
So essentially you are proposing to output, instead of average thread rate,
the collective thread rate. The problem is, my understanding of that Threads() feature is,
it is mostly used to measure how well the code-under-benchmark scales with the number of threads,
and currently, doubling the number of threads, in both of these scenarios,
would not change the rate, and thus one can directly visually compare them,
but with this patch, it would linearly scale with number of threads.
Of course, there is kAvgThreads, but the existing code will be broken.
@dmah42 is my logic correct here?
So essentially you are proposing to output, instead of average thread rate, the collective thread rate.
Exactly. That's what I would expect based on the documentation already, and from my understanding this is how it worked before https://github.com/google/benchmark/pull/1836, right? Here are the relevant docs:
// Set the counter as a rate. It will be presented divided
// by the duration of the benchmark.
// Meaning: per one second, how many 'foo's are processed?
state.counters["FooRate"] = Counter(numFoos, benchmark::Counter::kIsRate);
// Set the counter as a thread-average quantity. It will
// be presented divided by the number of threads.
state.counters["FooAvg"] = Counter(numFoos, benchmark::Counter::kAvgThreads);
// There's also a combined flag:
state.counters["FooAvgRate"] = Counter(numFoos,benchmark::Counter::kAvgThreadsRate);
The problem is, my understanding of that Threads() feature is, it is mostly used to measure how well the code-under-benchmark scales with the number of threads, and currently, doubling the number of threads, in both of these scenarios, would not change the rate, and thus one can directly visually compare them, but with this patch, it would linearly scale with number of threads. Of course, there is kAvgThreads, but the existing code will be broken.
It depends on what you want to measure. If you are interested at the per-thread throughput or global throughput. I would argue that existing code is already broken because kAvgThreadsRate is incorrect (it is dividing by the number of threads twice, and does not mean anything) and kIsRate works differently than before. To me, this is a bug that can be fixed without introducing a new configuration. WDYT?
but with this patch, it would linearly scale with number of threads
If your algorithm scales linearly, yes 😉
I guess i'm just mainly saddened that we are breaking the behaviour all the time.
Alright. So @LebedevRI @dmah42 once you both agree that my proposal in https://github.com/google/benchmark/pull/2081 makes sense, I'll look into getting the tests adjusted and preparing the PR for review.
After sleeping on it, i do suppose that proposed behaviour may be better, yes.
i think this is a positive change. having said that, i think in a year or two someone will come along with a convincing argument why this isn't the right thing to do and we should do X where X is slightly different, as i don't think there's one "right answer". so if we all agree that this PR is the right answer (now) then can we please also document this behaviour thoroughly in markdown in the docs area so we can point any future well-meaning PRs at that document.
so if we all agree that this PR is the right answer (now) then can we please also document this behaviour thoroughly in markdown in the docs area so we can point any future well-meaning PRs at that document
Will do!