benchmark
benchmark copied to clipboard
[BUG] (or not?) A weird behavior of DoNotOptimize on Mac M1
Describe the bug I'm not really sure if it's either normal, a clang problem, a benchmark problem, a Mac problem, or a user error. 2 simple bench :
void BM_sum32(benchmark::State& state) {
for (auto _ : state) {
uint32_t sum = 0;
for(int i = 0; i < state.range(0); ++i) {
sum += i;
}
benchmark::DoNotOptimize(sum);
}
state.SetItemsProcessed(state.iterations() * state.range(0));
}
BENCHMARK(BM_sum32)->RangeMultiplier(8)->Range(8, 8<<8);
void BM_sum64(benchmark::State& state) {
for (auto _ : state) {
uint64_t sum = 0;
for(int i = 0; i < state.range(0); ++i) {
sum += i;
}
benchmark::DoNotOptimize(sum);
}
state.SetItemsProcessed(state.iterations() * state.range(0));
}
BENCHMARK(BM_sum64)->RangeMultiplier(8)->Range(8, 8<<8);
The only difference is uint32_t vs uint64_t.
BM_sum32/8 0.555 ns 0.555 ns 1000000000 items_per_second=14.4186G/s
BM_sum32/64 0.553 ns 0.553 ns 1000000000 items_per_second=115.741G/s
BM_sum32/512 0.555 ns 0.554 ns 1000000000 items_per_second=923.448G/s
BM_sum32/2048 0.552 ns 0.552 ns 1000000000 items_per_second=3.7121T/s
BM_sum64/8 3.12 ns 3.12 ns 223279077 items_per_second=2.56145G/s
BM_sum64/64 20.6 ns 20.6 ns 33865342 items_per_second=3.10433G/s
BM_sum64/512 169 ns 169 ns 4135943 items_per_second=3.02456G/s
BM_sum64/2048 650 ns 650 ns 1078898 items_per_second=3.15298G/s
- In release compilation, the sum32 loop seems to be "optimized" (removed) while it's ok with sum64.
- in debug compilation, both bench are ok.
System Which OS, compiler, and compiler version are you using:
Darwin Air-de-ker 20.6.0 Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 arm64```
clang --version
Apple clang version 12.0.5 (clang-1205.0.22.9)
Target: arm64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
Expected behavior The loop shouldn't be optimized-out in loop32 ?
edit As a side note : if I move "benchmark::DoNotOptimize(sum);" inside the for-loop, both loop works in release mode, but then the performance drop dramatically from 3G/s to 500M/s. Which is consistant with the result I get from this code (basically benchmarking DoNotOptimize() : also ~500M/s) :
void BM_loop(benchmark::State& state) {
for (auto _ : state) {
for(int i = 0; i < state.range(0); ++i)
benchmark::DoNotOptimize(i);
}
state.SetItemsProcessed(state.iterations() * state.range(0));
}
Well... This works, I don't understand, but I guess it means there is no problem with DoNotOptimize() ... I just changed the loop index from int to uint.
void BM_sum32(benchmark::State& state) {
for (auto _ : state) {
uint32_t sum = 0;
for(uint i = 0; i < state.range(0); ++i) {
sum += i;
}
benchmark::DoNotOptimize(sum);
}
state.SetItemsProcessed(state.iterations() * state.range(0));
state.SetBytesProcessed(state.iterations() * state.range(0) * sizeof(uint32_t));
}
but it bugs again if I use uint64_t as an index ...
I simplified the problem. it confuse me.
fail :
for (auto _ : state) {
uint32_t sum = 0;
uint32_t size = state.range(0);
for(uint32_t i = 0; i < size; ++i) {
sum++;
}
benchmark::DoNotOptimize(sum);
}
works :
for (auto _ : state) {
uint32_t sum = 0;
uint64_t size = state.range(0);
for(uint32_t i = 0; i < size; ++i) {
sum++;
}
benchmark::DoNotOptimize(sum);
}
fail :
for (auto _ : state) {
uint64_t sum = 0;
uint64_t size = state.range(0);
for(uint64_t i = 0; i < size; ++i) {
sum++;
}
benchmark::DoNotOptimize(sum);
}
work :
for (auto _ : state) {
uint64_t sum = 0;
uint64_t size = state.range(0);
for(uint32_t i = 0; i < size; ++i) {
sum++;
}
benchmark::DoNotOptimize(sum);
}
- 32/32 fail, 64/64 fail
- 32/64 works, 64/32 works
- i even considered that DoNotOptimize serve no purpose outside the loop and it's not optimized out because of the different datatype. But if I remove it, everything run in 0ns.
when you say "fail" i assume you mean "the loop is optimized out".
checking the disassembly: your first example does indeed optimize out the loop.
the second doesn't, but i imagine that's because there's a potential side effect of the cast from uint64_t
to uint32_t
which can't be ignored by the compiler (total guess).
the root problem i think is that the DoNotOptimize
method isn't working for M1 macs.
yes, by "fail" I mean "optimized out"
DoNotOptimize is (likely) working fine. The issue is that the sum is being computed with a straight formula N*(N-1)/2
instead of a loop.
You can see here Godbolt that even on x86-64 the first two loops with benchmark::DoNotOptimize()
are collapsed into a straight formula that will take a couple of cycles to compute while the second batch marked with BENCHMARK_DONT_OPTIMIZE has a loop.