benchmark [BUG] (or not?) A weird behavior of DoNotOptimize on Mac M1

Describe the bug I'm not really sure if it's either normal, a clang problem, a benchmark problem, a Mac problem, or a user error. 2 simple bench :

void BM_sum32(benchmark::State& state) {
    for (auto _ : state) {
        uint32_t sum = 0;
        for(int i = 0; i < state.range(0); ++i) {
            sum += i;
        }
        benchmark::DoNotOptimize(sum);
    }
    state.SetItemsProcessed(state.iterations() * state.range(0));
}
BENCHMARK(BM_sum32)->RangeMultiplier(8)->Range(8, 8<<8);

void BM_sum64(benchmark::State& state) {
    for (auto _ : state) {
        uint64_t sum = 0;
        for(int i = 0; i < state.range(0); ++i) {
            sum += i;
        }
        benchmark::DoNotOptimize(sum);
    }
    state.SetItemsProcessed(state.iterations() * state.range(0));
}
BENCHMARK(BM_sum64)->RangeMultiplier(8)->Range(8, 8<<8);

The only difference is uint32_t vs uint64_t.

BM_sum32/8         0.555 ns        0.555 ns   1000000000 items_per_second=14.4186G/s
BM_sum32/64        0.553 ns        0.553 ns   1000000000 items_per_second=115.741G/s
BM_sum32/512       0.555 ns        0.554 ns   1000000000 items_per_second=923.448G/s
BM_sum32/2048      0.552 ns        0.552 ns   1000000000 items_per_second=3.7121T/s

BM_sum64/8          3.12 ns         3.12 ns    223279077 items_per_second=2.56145G/s
BM_sum64/64         20.6 ns         20.6 ns     33865342 items_per_second=3.10433G/s
BM_sum64/512         169 ns          169 ns      4135943 items_per_second=3.02456G/s
BM_sum64/2048        650 ns          650 ns      1078898 items_per_second=3.15298G/s

In release compilation, the sum32 loop seems to be "optimized" (removed) while it's ok with sum64.
in debug compilation, both bench are ok.

System Which OS, compiler, and compiler version are you using:

Darwin Air-de-ker 20.6.0 Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 arm64```

clang --version
Apple clang version 12.0.5 (clang-1205.0.22.9)
Target: arm64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Expected behavior The loop shouldn't be optimized-out in loop32 ?

edit As a side note : if I move "benchmark::DoNotOptimize(sum);" inside the for-loop, both loop works in release mode, but then the performance drop dramatically from 3G/s to 500M/s. Which is consistant with the result I get from this code (basically benchmarking DoNotOptimize() : also ~500M/s) :

void BM_loop(benchmark::State& state) {
    for (auto _ : state) {
        for(int i = 0; i < state.range(0); ++i)
            benchmark::DoNotOptimize(i);
    }
    state.SetItemsProcessed(state.iterations() * state.range(0));
}

Sep 29 '21 17:09 ker2x

Well... This works, I don't understand, but I guess it means there is no problem with DoNotOptimize() ... I just changed the loop index from int to uint.

void BM_sum32(benchmark::State& state) {
    for (auto _ : state) {
        uint32_t sum = 0;
        for(uint i = 0; i < state.range(0); ++i) {
            sum += i;
        }
        benchmark::DoNotOptimize(sum);
    }
    state.SetItemsProcessed(state.iterations() * state.range(0));
    state.SetBytesProcessed(state.iterations() * state.range(0) * sizeof(uint32_t));

}

but it bugs again if I use uint64_t as an index ...

Sep 29 '21 18:09 ker2x

I simplified the problem. it confuse me.

fail :

    for (auto _ : state) {
        uint32_t sum = 0;
        uint32_t size = state.range(0);
        for(uint32_t i = 0; i < size; ++i) {
            sum++;
        }
        benchmark::DoNotOptimize(sum);
    }

works :

    for (auto _ : state) {
        uint32_t sum = 0;
        uint64_t size = state.range(0);
        for(uint32_t i = 0; i < size; ++i) {
            sum++;
        }
        benchmark::DoNotOptimize(sum);
    }

fail :

    for (auto _ : state) {
        uint64_t sum = 0;
        uint64_t size = state.range(0);
        for(uint64_t i = 0; i < size; ++i) {
            sum++;
        }
        benchmark::DoNotOptimize(sum);
    }

work :

   for (auto _ : state) {
        uint64_t sum = 0;
        uint64_t size = state.range(0);
        for(uint32_t i = 0; i < size; ++i) {
            sum++;
        }
        benchmark::DoNotOptimize(sum);
    }

32/32 fail, 64/64 fail
32/64 works, 64/32 works
i even considered that DoNotOptimize serve no purpose outside the loop and it's not optimized out because of the different datatype. But if I remove it, everything run in 0ns.

Sep 29 '21 19:09 ker2x

when you say "fail" i assume you mean "the loop is optimized out".

checking the disassembly: your first example does indeed optimize out the loop.

the second doesn't, but i imagine that's because there's a potential side effect of the cast from uint64_t to uint32_t which can't be ignored by the compiler (total guess).

the root problem i think is that the DoNotOptimize method isn't working for M1 macs.

Sep 30 '21 09:09 dmah42

yes, by "fail" I mean "optimized out"

Sep 30 '21 17:09 ker2x

DoNotOptimize is (likely) working fine. The issue is that the sum is being computed with a straight formula N*(N-1)/2 instead of a loop.

You can see here Godbolt that even on x86-64 the first two loops with benchmark::DoNotOptimize() are collapsed into a straight formula that will take a couple of cycles to compute while the second batch marked with BENCHMARK_DONT_OPTIMIZE has a loop.

Mar 02 '23 08:03 HFTrader

benchmark benchmark copied to clipboard

[BUG] (or not?) A weird behavior of DoNotOptimize on Mac M1

benchmark
benchmark copied to clipboard