[BUG] Certain performance counters are not measured and are always zero
Describe the bug
Certain counters such as frontend_retired.l2_miss are always reported as zero using the performance counters integration with libpfm.
To confirm that this is indeed a bug I have made a small example where I'm flushing a function from all caches using the clflush instruction and then calling it, which should lead to an L2 miss. perf stat and record confirm that frontend_retired.l2_miss should be non-zero and the miss is happening within the function, however the counters reported by google benchmark are always zero. I have also noticed this issue with a few other events such as frontend_retired.latency_ge_1.
System Which OS, compiler, and compiler version are you using:
- OS: Ubuntu 22.04, Kernel 6.8.0-49-generic
- Compiler and version: GNU assembler version 2.38 (x86_64-linux-gnu) using BFD version (GNU Binutils for Ubuntu) 2.38, g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
- Processor: 11th Gen Intel(R) Core(TM) i5-11400H
To reproduce
Make a file fun.S with the following contents.
.section .data
.section .text
.global fun
.global not_fun
.align 512
not_fun:
nop
ret
.align 512
fun:
clflush not_fun
mfence
call not_fun
ret
The function fun flushes not_fun from all the caches and then calls it, which will cause a cache miss as it tries to fetch the first instruction of the function.
Make two more files, prog1.S and prog2.cpp respectively with the following contents.
.section .data
.section .text
.global _start
_start:
mov $10000000, %rcx
loop:
call fun
dec %rcx
jnz loop
xor %rdi,%rdi
mov $60, %rax
syscall
#include <benchmark/benchmark.h>
extern "C" void fun();
void BM(benchmark::State &state) {
for (auto _ : state) {
fun();
}
}
BENCHMARK(BM);
BENCHMARK_MAIN();
and a Makefile:
all: fun.o prog1.o prog2.o
g++ prog1.o fun.o -o prog1.out -std=c++17 -nostdlib -no-pie -O3 -Wall -Werror -fno-omit-frame-pointer -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -lpfm
g++ prog2.o fun.o -o prog2.out -std=c++17 -no-pie -O3 -Wall -Werror -fno-omit-frame-pointer -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -lpfm
prog1.o: prog1.S
as prog1.S -o prog1.o
fun.o: fun.S
as fun.S -o fun.o
prog2.o:
g++ prog2.cpp -c -std=c++17 -no-pie -O3 -Wall -Werror -fno-omit-frame-pointer -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -lpfm
clean:
rm *.o *.out
First, I run prog1.out along with perf to get the expected results without google benchmark's code interfering with everything. The code for prog1.S is simply a loop which calls fun 1e7 times.
$ sudo perf stat -e cycles,frontend_retired.l1i_miss,frontend_retired.l2_miss,frontend_retired.latency_ge_1 taskset -c 0 nice -n -20 ./prog1.out
Performance counter stats for 'taskset -c 0 nice -n -20 ./prog1.out':
6,71,13,49,717 cycles (49.99%)
14,09,768 frontend_retired.l1i_miss (50.04%)
13,26,441 frontend_retired.l2_miss (24.98%)
3,53,09,157 frontend_retired.latency_ge_1 (24.98%)
1.517582000 seconds time elapsed
1.515770000 seconds user
0.000999000 seconds sys
As expected, we see L2 misses. Using perf record/sampling it can be confirmed that all the misses are inside fun
Now, I run prog2.out to see what google benchmarks reports the counters as.
$ sudo taskset -c 0 nice -n -20 ./prog2.out --benchmark_perf_counters=CYCLES,frontend_retired.l1i_miss,frontend_retired.l2_miss,frontend_retired.latency_ge_1 --benchmark_counters_tabular=true --benchmark_min_time=10000000x
Running ./prog2.out
Run on (12 X 4279.92 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 1280 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.18, 0.38, 0.49
-------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations CYCLES frontend_retired.l1i_miss frontend_retired.l2_miss frontend_retired.latency_ge_1
-------------------------------------------------------------------------------------------------------------------------------------------------
BM 158 ns 158 ns 10000000 702.339 0.342522 0 0
and the L2 misses are zero.
For one more final confirmation we run google benchmark under perf stat - I'm not sure if both programs trying to use the same counters will mess something up, but anyways google benchmark again reports the counter values as zero whereas perf stat gives non-zero values. It is unlikely that these non-zero values are cause by google benchmarks code instead of the test code, as I have shown above that the test code on its own indeed causes the event as confirmed using perf stat + record.
$ sudo perf stat -e cycles,frontend_retired.l1i_miss,frontend_retired.l2_miss,frontend_retired.latency_ge_1 taskset -c 0 nice -n -20 ./prog2.out --benchmark_perf_counters=CYCLES,frontend_retired.l1i_miss,frontend_retired.l2_miss,frontend_retired.latency_ge_1 --benchmark_counters_tabular=true --benchmark_min_time=10000000x
Running ./prog2.out
Run on (12 X 4184.45 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 1280 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.41, 0.42, 0.47
-------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations CYCLES frontend_retired.l1i_miss frontend_retired.l2_miss frontend_retired.latency_ge_1
-------------------------------------------------------------------------------------------------------------------------------------------------
BM 158 ns 158 ns 10000000 699.904 0.28635 0 0
Performance counter stats for 'taskset -c 0 nice -n -20 ./prog2.out --benchmark_perf_counters=CYCLES,frontend_retired.l1i_miss,frontend_retired.l2_miss,frontend_retired.latency_ge_1 --benchmark_counters_tabular=true --benchmark_min_time=10000000x':
7,03,84,92,871 cycles (16.85%)
41,63,656 frontend_retired.l1i_miss (16.84%)
10,25,334 frontend_retired.l2_miss (0.13%)
26,30,55,471 frontend_retired.latency_ge_1 (0.13%)
1.592229730 seconds time elapsed
1.586011000 seconds user
0.006000000 seconds sys
Expected behavior The counter values should be non-zero when it is clear that L2 cache misses are happening (confirmed with perf).
can you identify what in the benchmark code might be affecting this? we use a library for perf counting and other than the configuration we pass, it's not obvious to me what could cause such a bug unless it's with the perf counting library itself.
@dmah42 actually I forgot to post an update to the issue, but a few days ago I realized that the issue is not with the specific counter, but rather with the order of arguments I pass in.
Like in the issue description I'm passing in ...,frontend_retired.l1i_miss,frontend_retired.l2_miss,... and the L2 misses are zero - whereas if I reverse the order of these two events I'm passing in the L2 misses will be non-zero whereas L1i misses will be zero. Maybe the events are not properly being multiplexed or there is some issue in the configuration being passed, I'm not sure.
i suspect maybe something in the reporting side rather than the configuration side. i can't see anything obvious wrong in the perf counters Stop method where we write the counter values, or in how those counters are reported.
out of curiousity, if you request JSON output instead do you get the results you expect?
No, the same issue exists with JSON output.
@mtrofin have you seen any issues like this?
@mtrofin have you seen any issues like this?
No. If you're passing 3 or less counter names, there should be no multiplexing either.
Libpfm translates string counter names to their underlying IDs. To try isolate the problem, could you could try fetching the raw IDs from google/benchmark and feed those to perf?