benchmark icon indicating copy to clipboard operation
benchmark copied to clipboard

[BUG] (or not?) A weird behavior of DoNotOptimize on Mac M1

Open ker2x opened this issue 3 years ago • 5 comments

Describe the bug I'm not really sure if it's either normal, a clang problem, a benchmark problem, a Mac problem, or a user error. 2 simple bench :

void BM_sum32(benchmark::State& state) {
    for (auto _ : state) {
        uint32_t sum = 0;
        for(int i = 0; i < state.range(0); ++i) {
            sum += i;
        }
        benchmark::DoNotOptimize(sum);
    }
    state.SetItemsProcessed(state.iterations() * state.range(0));
}
BENCHMARK(BM_sum32)->RangeMultiplier(8)->Range(8, 8<<8);

void BM_sum64(benchmark::State& state) {
    for (auto _ : state) {
        uint64_t sum = 0;
        for(int i = 0; i < state.range(0); ++i) {
            sum += i;
        }
        benchmark::DoNotOptimize(sum);
    }
    state.SetItemsProcessed(state.iterations() * state.range(0));
}
BENCHMARK(BM_sum64)->RangeMultiplier(8)->Range(8, 8<<8);

The only difference is uint32_t vs uint64_t.

BM_sum32/8         0.555 ns        0.555 ns   1000000000 items_per_second=14.4186G/s
BM_sum32/64        0.553 ns        0.553 ns   1000000000 items_per_second=115.741G/s
BM_sum32/512       0.555 ns        0.554 ns   1000000000 items_per_second=923.448G/s
BM_sum32/2048      0.552 ns        0.552 ns   1000000000 items_per_second=3.7121T/s

BM_sum64/8          3.12 ns         3.12 ns    223279077 items_per_second=2.56145G/s
BM_sum64/64         20.6 ns         20.6 ns     33865342 items_per_second=3.10433G/s
BM_sum64/512         169 ns          169 ns      4135943 items_per_second=3.02456G/s
BM_sum64/2048        650 ns          650 ns      1078898 items_per_second=3.15298G/s
  • In release compilation, the sum32 loop seems to be "optimized" (removed) while it's ok with sum64.
  • in debug compilation, both bench are ok.

System Which OS, compiler, and compiler version are you using:

Darwin Air-de-ker 20.6.0 Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 arm64```

clang --version
Apple clang version 12.0.5 (clang-1205.0.22.9)
Target: arm64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Expected behavior The loop shouldn't be optimized-out in loop32 ?

edit As a side note : if I move "benchmark::DoNotOptimize(sum);" inside the for-loop, both loop works in release mode, but then the performance drop dramatically from 3G/s to 500M/s. Which is consistant with the result I get from this code (basically benchmarking DoNotOptimize() : also ~500M/s) :

void BM_loop(benchmark::State& state) {
    for (auto _ : state) {
        for(int i = 0; i < state.range(0); ++i)
            benchmark::DoNotOptimize(i);
    }
    state.SetItemsProcessed(state.iterations() * state.range(0));
}

ker2x avatar Sep 29 '21 17:09 ker2x

Well... This works, I don't understand, but I guess it means there is no problem with DoNotOptimize() ... I just changed the loop index from int to uint.

void BM_sum32(benchmark::State& state) {
    for (auto _ : state) {
        uint32_t sum = 0;
        for(uint i = 0; i < state.range(0); ++i) {
            sum += i;
        }
        benchmark::DoNotOptimize(sum);
    }
    state.SetItemsProcessed(state.iterations() * state.range(0));
    state.SetBytesProcessed(state.iterations() * state.range(0) * sizeof(uint32_t));

}

but it bugs again if I use uint64_t as an index ...

ker2x avatar Sep 29 '21 18:09 ker2x

I simplified the problem. it confuse me.

fail :

    for (auto _ : state) {
        uint32_t sum = 0;
        uint32_t size = state.range(0);
        for(uint32_t i = 0; i < size; ++i) {
            sum++;
        }
        benchmark::DoNotOptimize(sum);
    }

works :

    for (auto _ : state) {
        uint32_t sum = 0;
        uint64_t size = state.range(0);
        for(uint32_t i = 0; i < size; ++i) {
            sum++;
        }
        benchmark::DoNotOptimize(sum);
    }
    

fail :

    for (auto _ : state) {
        uint64_t sum = 0;
        uint64_t size = state.range(0);
        for(uint64_t i = 0; i < size; ++i) {
            sum++;
        }
        benchmark::DoNotOptimize(sum);
    }

work :

   for (auto _ : state) {
        uint64_t sum = 0;
        uint64_t size = state.range(0);
        for(uint32_t i = 0; i < size; ++i) {
            sum++;
        }
        benchmark::DoNotOptimize(sum);
    }
    
  • 32/32 fail, 64/64 fail
  • 32/64 works, 64/32 works
  • i even considered that DoNotOptimize serve no purpose outside the loop and it's not optimized out because of the different datatype. But if I remove it, everything run in 0ns.

ker2x avatar Sep 29 '21 19:09 ker2x

when you say "fail" i assume you mean "the loop is optimized out".

checking the disassembly: your first example does indeed optimize out the loop.

the second doesn't, but i imagine that's because there's a potential side effect of the cast from uint64_t to uint32_t which can't be ignored by the compiler (total guess).

the root problem i think is that the DoNotOptimize method isn't working for M1 macs.

dmah42 avatar Sep 30 '21 09:09 dmah42

yes, by "fail" I mean "optimized out"

ker2x avatar Sep 30 '21 17:09 ker2x

DoNotOptimize is (likely) working fine. The issue is that the sum is being computed with a straight formula N*(N-1)/2 instead of a loop.

You can see here Godbolt that even on x86-64 the first two loops with benchmark::DoNotOptimize() are collapsed into a straight formula that will take a couple of cycles to compute while the second batch marked with BENCHMARK_DONT_OPTIMIZE has a loop.

HFTrader avatar Mar 02 '23 08:03 HFTrader