benchmark icon indicating copy to clipboard operation
benchmark copied to clipboard

Different results, depending on wether certain benchmarks are enabled.

Open DenisYaroshevskiy opened this issue 7 years ago • 20 comments

Hi!

If I enable set_union_unbalanced benchmark, I get:

set_union_benchmark<std_set_union> 9301 ns 9238 ns 74616 set_union_benchmark<set_union_unbalanced> 9586 ns 9550 ns 76047 However, if it's disabled (commented out), I get:

set_union_benchmark<std_set_union> 3376 ns 3364 ns 207403

Results are consistent and depend a lot on partition_point_biased code. Seems like a bug. code: https://gist.github.com/DenisYaroshevskiy/604f1e331b384579b2a4fb5eac3ac8ef

System info: 2017-10-17 18 31 07

clang --version Apple LLVM version 9.0.0 (clang-900.0.38) Target: x86_64-apple-darwin16.7.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

DenisYaroshevskiy avatar Oct 17 '17 15:10 DenisYaroshevskiy

Upd: Seems like the issue goes away if I use BENCHMARK instead of BENCHMARK_TEMPLATE. Is this even possible?

DenisYaroshevskiy avatar Oct 17 '17 15:10 DenisYaroshevskiy

#define BENCHMARK(n)                                     \
  BENCHMARK_PRIVATE_DECLARE(n) =                         \
      (::benchmark::internal::RegisterBenchmarkInternal( \
          new ::benchmark::internal::FunctionBenchmark(#n, n)))
#define BENCHMARK_TEMPLATE1(n, a)                        \
  BENCHMARK_PRIVATE_DECLARE(n) =                         \
      (::benchmark::internal::RegisterBenchmarkInternal( \
          new ::benchmark::internal::FunctionBenchmark(#n "<" #a ">", n<a>)))

I don't think that replacing one with the other would make any difference.

can you put the files into a gist so it's easier to take a look at them?

dmah42 avatar Oct 17 '17 16:10 dmah42

That makes sence. Didn't reproduce since though.

Here is gist, I didn't know about it, don't use github that often: https://gist.github.com/DenisYaroshevskiy/c1c28e3fd95f8a7eb53ddab40dc8464f

DenisYaroshevskiy avatar Oct 17 '17 17:10 DenisYaroshevskiy

$ clang++ ./set_union_bench.cc -o set_union_bench -I ~/git/benchmark/include/ -L ~/git/benchmark/_build/src/ -lbenchmark -pthread --std=c++11
In file included from ./set_union_bench.cc:7:
./lib.h:209:10: error: chosen constructor is explicit in copy-initialization
  return {f1, f2, o};

am i missing some requirement?

dmah42 avatar Oct 17 '17 18:10 dmah42

This is mine compilation line. clang++ --std=c++14 -O3 -Werror -Wall $1 -I /space/google_benchmarks/include/ /space/google_benchmarks/build/src/libbenchmark.a set_union_bench.cc

Reproduced the bug again, no problem.

I use https://github.com/google/benchmark/releases/tag/v1.2.0 But previously it manifested itself on master too.

DenisYaroshevskiy avatar Oct 17 '17 18:10 DenisYaroshevskiy

I can answer faster in Telegram, if it helps: @dyaroshev

And the problem still goes away, if you do functions instead.

DenisYaroshevskiy avatar Oct 17 '17 18:10 DenisYaroshevskiy

You're using libc++ as well right?

I'm still trying to get your example to compile :)

dmah42 avatar Oct 17 '17 18:10 dmah42

Yes, libc++, out of the box, no hand waving! I even have the Wall on! Do you use c++14?

DenisYaroshevskiy avatar Oct 17 '17 18:10 DenisYaroshevskiy

Ok, seems like the older version of clang is the problem - I reproduced your compilation error on quick-bench.

Here you go, a fixed gist: https://gist.github.com/DenisYaroshevskiy/604f1e331b384579b2a4fb5eac3ac8ef

I even made it single file

The problem is still here.

DenisYaroshevskiy avatar Oct 17 '17 18:10 DenisYaroshevskiy

Hi!

I found a repro, that kinda works on quick-bench too. Here is the code: http://quick-bench.com/pFHnrUKpCRJMgTPBH6pcZImXYIw Enabling/Disabling CurrentSetUnion makes LinearSetUnion slower by a millisecond.

On my machine, things are much worse:

----------------------------------------------------------------
Benchmark                         Time           CPU Iterations
----------------------------------------------------------------
baseline/560/1440               152 ns        152 ns    4313718
LinearSetUnion/560/1440        4705 ns       4703 ns     149426
CurrentSetUnion/560/1440       3130 ns       3130 ns     222490

---------------------------------------------------------------
Benchmark                        Time           CPU Iterations
---------------------------------------------------------------
baseline/560/1440              153 ns        153 ns    4512810
LinearSetUnion/560/1440       1909 ns       1909 ns     370949

DenisYaroshevskiy avatar Nov 06 '17 00:11 DenisYaroshevskiy

I think I may have found the reason, but I haven't had a chance to test it yet. I figured instead of waiting longer to get the chance to test it, I could post my thoughts and maybe someone else can.

https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues

reminded me about issues I'd seen before with aligning functions on i-cache line boundaries. I wonder if this is what's happening. Further, I wonder if we should recommend somewhere forcing function level i-cache line alignment when compiling benchmarks.

dmah42 avatar Feb 13 '18 01:02 dmah42

I can't really reproduce with the quick-bench link provided.

EricWF avatar Feb 13 '18 02:02 EricWF

Great work.

I'm not 100% sure that function alignment is exactly the reason - since I've seen examples where changing from BENCHMARK to BENCHMARK_TEMPLATE causes the bug to occur, which shouldn't affect the code gen (I think).

Here is what I can do: I have about 10 benchmarks, that consecutively compare: previous version/current version. (v1, v2) (v2, v3) ... At one of the comparisons either at -O2 or -O3 the issue would occur. If you can tell me what I should do, I can redo the benchmarks (with new clang and updated google benchmark) and see if the issue would reoccur. If it does not - we can call it a day. If it does than it's not it.

DenisYaroshevskiy avatar Feb 13 '18 12:02 DenisYaroshevskiy

You can't do it in quickbench, but adding -mllvm -align-all-blocks=5 should force everything to be i-cache aligned (as per the blog post) and might eliminate the difference.

dmah42 avatar Feb 15 '18 20:02 dmah42

@dominichamon Thank you, I'll try that.

DenisYaroshevskiy avatar Feb 15 '18 20:02 DenisYaroshevskiy

Hi.

Sorry for an update after almost 2 years - I found a workaround for that project and then never got back to this. Until now.

I was trying to test wether this is code alignment or not.

I found a benchmark where I can reproduce this and then added noops from 1 to 1000 in a function above it https://github.com/DenisYaroshevskiy/algorithm_dumpster/blob/0611d757fe1df13bdfbf23d9d41fd20878899ea9/src/code_alignment.cc#L9

(script that compiles the file) https://github.com/DenisYaroshevskiy/algorithm_dumpster/blob/0611d757fe1df13bdfbf23d9d41fd20878899ea9/scripts/code_alignment_experiment.sh#L1

I did check the objdump => noops are there and the function is indid before (I might be bad at reading objdumps - if there is a trick to that that I do not know).

However - the difference in the results for all 1..1000 noops is negligible.

In the mean time, same benchmark, same file BUT - passing not-passing the unused argument is more then 3 times difference. (with arg it's faster)

https://github.com/DenisYaroshevskiy/algorithm_dumpster/blob/0611d757fe1df13bdfbf23d9d41fd20878899ea9/src/code_alignment.cc#L40

Complete console output:

deniss-air:build denisarosevskij$   clang++ -lbenchmark --std=c++17 -O3 -DUSE_ARG -I ../src ../src/code_alignment.cc ../src/google_benchmark_main.cc
deniss-air:build denisarosevskij$ ./a.out 
2019-08-02 02:59:53
Running ./a.out
Run on (4 X 1800 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 3145K (x1)
--------------------------------------------------------
Benchmark                 Time           CPU Iterations
--------------------------------------------------------
test_benchmark/0       1880 ns       1877 ns     370938
deniss-air:build denisarosevskij$   clang++ -lbenchmark --std=c++17 -O3 -I ../src ../src/code_alignment.cc ../src/google_benchmark_main.cc
deniss-air:build denisarosevskij$ ./a.out 
2019-08-02 03:00:05
Running ./a.out
Run on (4 X 1800 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 3145K (x1)
------------------------------------------------------
Benchmark               Time           CPU Iterations
------------------------------------------------------
test_benchmark       7064 ns       7054 ns      98439
deniss-air:build denisarosevskij$ 

Clang version:

 clang --version
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

I'm fairly confident, this is not me making a problem: I have seen the same issue on QuickBench and on a different machine, with libc++ benchmarks, which I believe @EricWF set up.

What do you think?

PS: there is a link to quickbench in the code - depending on wether the do_nothing is uncommented or not, performance jumps by some noticeable number. But I call do_nothing INSIDE the benchmark. This I would believe is code alignment. However - looking through objdumps, clang always seemed to be aligning the beginning of the function to smth

DenisYaroshevskiy avatar Aug 02 '19 02:08 DenisYaroshevskiy

Did a bit more digging.

Now I'm pretty sure that this is code alignment - there is nothing else! I ran this script (console output in the end) uploaded results

If I just look at difference in assembly, produced by clang for the src/code_alignment.cc - there is nothing there.

The only difference comes from addresses. If I look at the first mention of __Z14test_benchmarkRN9benchmark5StateE: - in fast is at address 100028680, in slow at 100028690.

Running diff on objdump proves almost completely useless, even if I feed it all of the options to not spit out numbers, but still - I looked through the cleaner one and didn't find anything else apart from addresses.

The real question though is: is it possible to construct a clear experiment (similar to the one with NOOPs in a function before) to prove that code alignment gave a 3 times difference in timings (which even sounds ridiculous but that's what the numbers are).

And then it would be possible to test out the compiler options for alignment, to see if careful applying of them to the benchmark function would help.

Console output

deniss-air:build denisarosevskij$ ../scripts/generate_google_benchmark_data.sh 
+ rm ../data/google_benchmark_461/asm_diff.diff ../data/google_benchmark_461/fast.s ../data/google_benchmark_461/fast_clean_objdump.txt ../data/google_benchmark_461/fast_objdump.txt ../data/google_benchmark_461/objdump_clean_diff.diff ../data/google_benchmark_461/objdump_diff.diff ../data/google_benchmark_461/slow.s ../data/google_benchmark_461/slow_clean_objdump.txt ../data/google_benchmark_461/slow_objdump.txt
+ clang++ -lbenchmark --std=c++17 -O3 -g -DUSE_ARG -I ../src ../src/code_alignment.cc ../src/google_benchmark_main.cc -o fast.out
+ ./fast.out
2019-08-03 00:43:13
Running ./fast.out
Run on (4 X 1800 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 3145K (x1)
--------------------------------------------------------
Benchmark                 Time           CPU Iterations
--------------------------------------------------------
test_benchmark/0       1921 ns       1913 ns     362503
+ clang++ --std=c++17 -O3 -S -DUSE_ARG -I ../src ../src/code_alignment.cc -o ../data/google_benchmark_461/fast.s
+ objdump -disassemble fast.out
+ objdump -disassemble -no-leading-addr -no-show-raw-insn -macho -source fast.out
+ clang++ -lbenchmark --std=c++17 -O3 -g -I ../src ../src/code_alignment.cc ../src/google_benchmark_main.cc -o slow.out
+ ./slow.out
2019-08-03 00:43:17
Running ./slow.out
Run on (4 X 1800 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 3145K (x1)
------------------------------------------------------
Benchmark               Time           CPU Iterations
------------------------------------------------------
test_benchmark       7338 ns       7307 ns      95409
+ clang++ --std=c++17 -O3 -S -I ../src ../src/code_alignment.cc -o ../data/google_benchmark_461/slow.s
+ objdump -disassemble slow.out
+ objdump -disassemble -no-leading-addr -no-show-raw-insn -macho -source slow.out
+ diff ../data/google_benchmark_461/fast.s ../data/google_benchmark_461/slow.s
+ diff ../data/google_benchmark_461/fast_objdump.txt ../data/google_benchmark_461/slow_objdump.txt
+ diff ../data/google_benchmark_461/fast_clean_objdump.txt ../data/google_benchmark_461/slow_clean_objdump.txt
deniss-air:build denisarosevskij$ 

DenisYaroshevskiy avatar Aug 03 '19 00:08 DenisYaroshevskiy

It's a little bit out of benchmark's scope i'd say. The experimental observation that the [code] alignment has serious effect on performance is actually not that surprising. The question here i suppose is why that happened - did the compiler underestimate the alignment needed, or did it never try to insert the padding in the first place, because wrong -march= was specified.

LebedevRI avatar Aug 04 '19 14:08 LebedevRI

@LebedevRI - I'm not interested in that as much as in I'd like to know 'how to write a benchmark so that it is unaffected by it'. There has to be a way to construct a clean-ish prove that for example -mllvm -align-all-blocks=5 would do the trick.

DenisYaroshevskiy avatar Aug 04 '19 18:08 DenisYaroshevskiy

https://easyperf.net/blog/2018/01/18/Code_alignment_issues is pretty much the canonical source for why that would always (ish) work.

There's also this paper that covers other potential measurement biases caused by compiler options.

If you have a workaround, or you find that -mllvm --align-all-blocks=5 allows you to make non-performance-related changes without significant impact on your results, i'd say stick with it.

We should likely document this somewhere :)

dmah42 avatar Aug 13 '19 13:08 dmah42