benchmark
benchmark copied to clipboard
Different results, depending on wether certain benchmarks are enabled.
Hi!
If I enable set_union_unbalanced benchmark, I get:
set_union_benchmark<std_set_union> 9301 ns 9238 ns 74616 set_union_benchmark<set_union_unbalanced> 9586 ns 9550 ns 76047
However, if it's disabled (commented out), I get:
set_union_benchmark<std_set_union> 3376 ns 3364 ns 207403
Results are consistent and depend a lot on partition_point_biased code. Seems like a bug. code: https://gist.github.com/DenisYaroshevskiy/604f1e331b384579b2a4fb5eac3ac8ef
System info:
clang --version Apple LLVM version 9.0.0 (clang-900.0.38) Target: x86_64-apple-darwin16.7.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Upd: Seems like the issue goes away if I use BENCHMARK instead of BENCHMARK_TEMPLATE. Is this even possible?
#define BENCHMARK(n) \
BENCHMARK_PRIVATE_DECLARE(n) = \
(::benchmark::internal::RegisterBenchmarkInternal( \
new ::benchmark::internal::FunctionBenchmark(#n, n)))
#define BENCHMARK_TEMPLATE1(n, a) \
BENCHMARK_PRIVATE_DECLARE(n) = \
(::benchmark::internal::RegisterBenchmarkInternal( \
new ::benchmark::internal::FunctionBenchmark(#n "<" #a ">", n<a>)))
I don't think that replacing one with the other would make any difference.
can you put the files into a gist so it's easier to take a look at them?
That makes sence. Didn't reproduce since though.
Here is gist, I didn't know about it, don't use github that often: https://gist.github.com/DenisYaroshevskiy/c1c28e3fd95f8a7eb53ddab40dc8464f
$ clang++ ./set_union_bench.cc -o set_union_bench -I ~/git/benchmark/include/ -L ~/git/benchmark/_build/src/ -lbenchmark -pthread --std=c++11
In file included from ./set_union_bench.cc:7:
./lib.h:209:10: error: chosen constructor is explicit in copy-initialization
return {f1, f2, o};
am i missing some requirement?
This is mine compilation line.
clang++ --std=c++14 -O3 -Werror -Wall $1 -I /space/google_benchmarks/include/ /space/google_benchmarks/build/src/libbenchmark.a set_union_bench.cc
Reproduced the bug again, no problem.
I use https://github.com/google/benchmark/releases/tag/v1.2.0 But previously it manifested itself on master too.
I can answer faster in Telegram, if it helps: @dyaroshev
And the problem still goes away, if you do functions instead.
You're using libc++ as well right?
I'm still trying to get your example to compile :)
Yes, libc++, out of the box, no hand waving! I even have the Wall on! Do you use c++14?
Ok, seems like the older version of clang is the problem - I reproduced your compilation error on quick-bench.
Here you go, a fixed gist: https://gist.github.com/DenisYaroshevskiy/604f1e331b384579b2a4fb5eac3ac8ef
I even made it single file
The problem is still here.
Hi!
I found a repro, that kinda works on quick-bench too. Here is the code: http://quick-bench.com/pFHnrUKpCRJMgTPBH6pcZImXYIw Enabling/Disabling CurrentSetUnion makes LinearSetUnion slower by a millisecond.
On my machine, things are much worse:
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
baseline/560/1440 152 ns 152 ns 4313718
LinearSetUnion/560/1440 4705 ns 4703 ns 149426
CurrentSetUnion/560/1440 3130 ns 3130 ns 222490
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
baseline/560/1440 153 ns 153 ns 4512810
LinearSetUnion/560/1440 1909 ns 1909 ns 370949
I think I may have found the reason, but I haven't had a chance to test it yet. I figured instead of waiting longer to get the chance to test it, I could post my thoughts and maybe someone else can.
https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues
reminded me about issues I'd seen before with aligning functions on i-cache line boundaries. I wonder if this is what's happening. Further, I wonder if we should recommend somewhere forcing function level i-cache line alignment when compiling benchmarks.
I can't really reproduce with the quick-bench link provided.
Great work.
I'm not 100% sure that function alignment is exactly the reason - since I've seen examples where changing from BENCHMARK to BENCHMARK_TEMPLATE causes the bug to occur, which shouldn't affect the code gen (I think).
Here is what I can do: I have about 10 benchmarks, that consecutively compare: previous version/current version. (v1, v2) (v2, v3) ... At one of the comparisons either at -O2 or -O3 the issue would occur. If you can tell me what I should do, I can redo the benchmarks (with new clang and updated google benchmark) and see if the issue would reoccur. If it does not - we can call it a day. If it does than it's not it.
You can't do it in quickbench, but adding -mllvm -align-all-blocks=5
should force everything to be i-cache aligned (as per the blog post) and might eliminate the difference.
@dominichamon Thank you, I'll try that.
Hi.
Sorry for an update after almost 2 years - I found a workaround for that project and then never got back to this. Until now.
I was trying to test wether this is code alignment or not.
I found a benchmark where I can reproduce this and then added noops from 1 to 1000 in a function above it https://github.com/DenisYaroshevskiy/algorithm_dumpster/blob/0611d757fe1df13bdfbf23d9d41fd20878899ea9/src/code_alignment.cc#L9
(script that compiles the file) https://github.com/DenisYaroshevskiy/algorithm_dumpster/blob/0611d757fe1df13bdfbf23d9d41fd20878899ea9/scripts/code_alignment_experiment.sh#L1
I did check the objdump => noops are there and the function is indid before (I might be bad at reading objdumps - if there is a trick to that that I do not know).
However - the difference in the results for all 1..1000 noops is negligible.
In the mean time, same benchmark, same file BUT - passing not-passing the unused argument is more then 3 times difference. (with arg it's faster)
https://github.com/DenisYaroshevskiy/algorithm_dumpster/blob/0611d757fe1df13bdfbf23d9d41fd20878899ea9/src/code_alignment.cc#L40
Complete console output:
deniss-air:build denisarosevskij$ clang++ -lbenchmark --std=c++17 -O3 -DUSE_ARG -I ../src ../src/code_alignment.cc ../src/google_benchmark_main.cc
deniss-air:build denisarosevskij$ ./a.out
2019-08-02 02:59:53
Running ./a.out
Run on (4 X 1800 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 3145K (x1)
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
test_benchmark/0 1880 ns 1877 ns 370938
deniss-air:build denisarosevskij$ clang++ -lbenchmark --std=c++17 -O3 -I ../src ../src/code_alignment.cc ../src/google_benchmark_main.cc
deniss-air:build denisarosevskij$ ./a.out
2019-08-02 03:00:05
Running ./a.out
Run on (4 X 1800 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 3145K (x1)
------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
test_benchmark 7064 ns 7054 ns 98439
deniss-air:build denisarosevskij$
Clang version:
clang --version
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
I'm fairly confident, this is not me making a problem: I have seen the same issue on QuickBench and on a different machine, with libc++ benchmarks, which I believe @EricWF set up.
What do you think?
PS: there is a link to quickbench in the code - depending on wether the do_nothing is uncommented or not, performance jumps by some noticeable number. But I call do_nothing INSIDE the benchmark. This I would believe is code alignment. However - looking through objdumps, clang always seemed to be aligning the beginning of the function to smth
Did a bit more digging.
Now I'm pretty sure that this is code alignment - there is nothing else! I ran this script (console output in the end) uploaded results
If I just look at difference in assembly, produced by clang for the src/code_alignment.cc
- there is nothing there.
The only difference comes from addresses.
If I look at the first mention of __Z14test_benchmarkRN9benchmark5StateE:
- in fast is at address 100028680, in slow at 100028690.
Running diff on objdump proves almost completely useless, even if I feed it all of the options to not spit out numbers, but still - I looked through the cleaner one and didn't find anything else apart from addresses.
The real question though is: is it possible to construct a clear experiment (similar to the one with NOOPs in a function before) to prove that code alignment gave a 3 times difference in timings (which even sounds ridiculous but that's what the numbers are).
And then it would be possible to test out the compiler options for alignment, to see if careful applying of them to the benchmark function would help.
Console output
deniss-air:build denisarosevskij$ ../scripts/generate_google_benchmark_data.sh
+ rm ../data/google_benchmark_461/asm_diff.diff ../data/google_benchmark_461/fast.s ../data/google_benchmark_461/fast_clean_objdump.txt ../data/google_benchmark_461/fast_objdump.txt ../data/google_benchmark_461/objdump_clean_diff.diff ../data/google_benchmark_461/objdump_diff.diff ../data/google_benchmark_461/slow.s ../data/google_benchmark_461/slow_clean_objdump.txt ../data/google_benchmark_461/slow_objdump.txt
+ clang++ -lbenchmark --std=c++17 -O3 -g -DUSE_ARG -I ../src ../src/code_alignment.cc ../src/google_benchmark_main.cc -o fast.out
+ ./fast.out
2019-08-03 00:43:13
Running ./fast.out
Run on (4 X 1800 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 3145K (x1)
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
test_benchmark/0 1921 ns 1913 ns 362503
+ clang++ --std=c++17 -O3 -S -DUSE_ARG -I ../src ../src/code_alignment.cc -o ../data/google_benchmark_461/fast.s
+ objdump -disassemble fast.out
+ objdump -disassemble -no-leading-addr -no-show-raw-insn -macho -source fast.out
+ clang++ -lbenchmark --std=c++17 -O3 -g -I ../src ../src/code_alignment.cc ../src/google_benchmark_main.cc -o slow.out
+ ./slow.out
2019-08-03 00:43:17
Running ./slow.out
Run on (4 X 1800 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 3145K (x1)
------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
test_benchmark 7338 ns 7307 ns 95409
+ clang++ --std=c++17 -O3 -S -I ../src ../src/code_alignment.cc -o ../data/google_benchmark_461/slow.s
+ objdump -disassemble slow.out
+ objdump -disassemble -no-leading-addr -no-show-raw-insn -macho -source slow.out
+ diff ../data/google_benchmark_461/fast.s ../data/google_benchmark_461/slow.s
+ diff ../data/google_benchmark_461/fast_objdump.txt ../data/google_benchmark_461/slow_objdump.txt
+ diff ../data/google_benchmark_461/fast_clean_objdump.txt ../data/google_benchmark_461/slow_clean_objdump.txt
deniss-air:build denisarosevskij$
It's a little bit out of benchmark
's scope i'd say. The experimental observation that the
[code] alignment has serious effect on performance is actually not that surprising.
The question here i suppose is why that happened - did the compiler underestimate
the alignment needed, or did it never try to insert the padding in the first place,
because wrong -march=
was specified.
@LebedevRI - I'm not interested in that as much as in I'd like to know 'how to write a benchmark so that it is unaffected by it'. There has to be a way to construct a clean-ish prove that for example -mllvm -align-all-blocks=5
would do the trick.
https://easyperf.net/blog/2018/01/18/Code_alignment_issues is pretty much the canonical source for why that would always (ish) work.
There's also this paper that covers other potential measurement biases caused by compiler options.
If you have a workaround, or you find that -mllvm --align-all-blocks=5
allows you to make non-performance-related changes without significant impact on your results, i'd say stick with it.
We should likely document this somewhere :)