arrow
arrow copied to clipboard
[C++][DONOTMERGE] Use -O2 instead of -O3 for RELEASE builds
Motivated by investigation in #13654. To be discussed
@ursabot benchmark please
Supported benchmark command examples:
@ursabot benchmark help
To run all benchmarks:
@ursabot please benchmark
To filter benchmarks by language:
@ursabot please benchmark lang=Python
@ursabot please benchmark lang=C++
@ursabot please benchmark lang=R
@ursabot please benchmark lang=Java
@ursabot please benchmark lang=JavaScript
To filter Python and R benchmarks by name:
@ursabot please benchmark name=file-write
@ursabot please benchmark name=file-write lang=Python
@ursabot please benchmark name=file-.*
To filter C++ benchmarks by archery --suite-filter and --benchmark-filter:
@ursabot please benchmark command=cpp-micro --suite-filter=arrow-compute-vector-selection-benchmark --benchmark-filter=TakeStringRandomIndicesWithNulls/262144/2 --iterations=3
For other command=cpp-micro options, please see https://github.com/ursacomputing/benchmarks/blob/main/benchmarks/cpp_micro_benchmarks.py
@ursabot please benchmark
Benchmark runs are scheduled for baseline = 1214083f7ece4e1797b7f3cdecfec1c2cfa8bf89 and contender = 46e31957885af71eef4adb4af6642ebb246247a7. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ec2-t3-xlarge-us-east-2
[Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] test-mac-arm
[Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ursa-i9-9960x
[Finished :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 46e31957 ec2-t3-xlarge-us-east-2
[Failed] 46e31957 test-mac-arm
[Failed] 46e31957 ursa-i9-9960x
[Finished] 46e31957 ursa-thinkcentre-m75q
[Failed] 1214083f ec2-t3-xlarge-us-east-2
[Failed] 1214083f test-mac-arm
[Failed] 1214083f ursa-i9-9960x
[Finished] 1214083f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
cc @pitrou
Here's a dump of symbols that shrink the most in -O2:
https://gist.github.com/wesm/4a2815077ed37b671d6160b8abec5e7c
I'd be interested to see if e.g. unsafe numeric casts are significantly affected by this
Looks great.
I believe the big regression from some tests are not real.
E.g., arrow-bit-util-benchmark : BenchmarkBitmapVisitUInt8And/32768/0 drops from 13.983 GiB/s (O3) to 503.375 MiB/s (O2).
Tested on my local host with clang-12, the result is 134MB/s, both O2 and O3. The huge gap is probably due to aggressive inline and optimization which makes the micro-benchmark far from reality.
One catch is gcc -O2 disables vectorization, while clang -O2 keeps it. We may need additional -fxxxx if want to keep some useful features.
Hmm, perhaps the bit-util micro-benchmarks are a bit pathologic, but other regressions seem real and quite significant...
Not sure of gcc version used in conbench. From gcc-10 man:
-O3 Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the following optimization
flags:
-fgcse-after-reload -finline-functions -fipa-cp-clone -floop-interchange -floop-unroll-and-jam -fpeel-loops
-fpredictive-commoning -fsplit-paths -ftree-loop-distribute-patterns -ftree-loop-distribution -ftree-loop-vectorize
-ftree-partial-pre -ftree-slp-vectorize -funswitch-loops -fvect-cost-model -fversion-loops-for-strides
Perhaps try -O2 -ftree-loop-vectorize? This does impact arrow-bit-util-benchmark gcc benchmark result per my test.
There's -ftree-vectorize which apparently enables all vectorization (both "tree" and "SLP").
Also, apparently with gcc 12 the following flags are enabled with -O2: -ftree-loop-vectorize -ftree-slp-vectorize -fvect-cost-model=very-cheap
@ursabot please benchmark lang=C++
Benchmark runs are scheduled for baseline = 1214083f7ece4e1797b7f3cdecfec1c2cfa8bf89 and contender = e4e430b4cb511667115f0d2cb9cc4a3c104ae6ae. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] test-mac-arm
[Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x
[Finished :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] e4e430b4 test-mac-arm
[Finished] e4e430b4 ursa-thinkcentre-m75q
[Failed] 1214083f ec2-t3-xlarge-us-east-2
[Failed] 1214083f test-mac-arm
[Failed] 1214083f ursa-i9-9960x
[Finished] 1214083f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
Ok, so I run locally with gcc 9.4.0 (Ubuntu 20.04) on AMD Zen 2.
Build times (with ccache disabled)
-O3:
real 1m25,888s
user 21m58,527s
sys 0m48,888s
-O2:
real 1m21,713s
user 20m57,462s
sys 0m48,843s
-O2 -ftree-vectorize:
real 1m23,461s
user 21m11,970s
sys 0m49,485s
Lib sizes
-O3:
text data bss dec hex filename
23230116 326105 2197509 25753730 188f882 build-bundled-release/release/libarrow.so
1276669 24928 2346 1303943 13e587 build-bundled-release/release/libarrow_testing.so
-O2:
22574451 326665 2197509 25098625 17ef981 build-bundled-release/release/libarrow.so
1268546 25144 2346 1296036 13c6a4 build-bundled-release/release/libarrow_testing.so
-O2 -ftree-vectorize:
22639315 326713 2197509 25163537 17ff711 build-bundled-release/release/libarrow.so
1257242 25144 2346 1284732 139a7c build-bundled-release/release/libarrow_testing.so
Compute benchmarks
-O2vs-O3: https://gist.github.com/pitrou/e76b89f7e7e1f0e9a361410286c4a198-O2 -ftree-vectorizevs-O3: https://gist.github.com/pitrou/feefd0819a513e6068dcd17b12d875e6-O2 -ftree-vectorizevs-O2: https://gist.github.com/pitrou/083d0abd3f404be47716ae31cddda2db
All in all, in this case:
- the size and build time reductions are significant but small (less than 5%)
-ftree-vectorizebrings a net performance improvement over bare-O2- performance seems globally comparable between
-O2 -ftree-vectorizeand-O3, despite some large disparities in individual micro-benchmarks
Hmm, I notice that -DNDEBUG was not passed anymore, so perhaps that affected some benchmarks :-(
Edit: re-ran benchmarks and updated the gists above.
@ursabot please benchmark lang=C++
Benchmark runs are scheduled for baseline = 1214083f7ece4e1797b7f3cdecfec1c2cfa8bf89 and contender = 6dd7bab55433062c767c7f170cdfb7e5d5a77943. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] test-mac-arm
[Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x
[Finished :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 6dd7bab5 test-mac-arm
[Finished] 6dd7bab5 ursa-thinkcentre-m75q
[Failed] 1214083f ec2-t3-xlarge-us-east-2
[Failed] 1214083f test-mac-arm
[Failed] 1214083f ursa-i9-9960x
[Finished] 1214083f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
cc @save-buffer @westonpace for further opinions
I would guess there are places that benefit from loop unswitching also.
I would guess there are places that benefit from loop unswitching also.
Hmm, I don't think it's our duty to micro-optimize compiler options, though. There are too many moving parts (compiler brand, compiler version, architecture, etc.).
I would say that we should just keep O3 and keep an eye on symbol sizes in case we need to intervene occasionally. On the whole I think the symbol sizes we have are not too bad.
@ursabot please benchmark lang=C++
Benchmark runs are scheduled for baseline = 8a2acaa40b97e6eafba196812d15a71f49f69d6a and contender = 7e5ca1a5b33e0c5e338e2c5f706b237defe91cb4. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] test-mac-arm
[Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x
[Finished :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7e5ca1a5 test-mac-arm
[Finished] 7e5ca1a5 ursa-thinkcentre-m75q
[Failed] 8a2acaa4 ec2-t3-xlarge-us-east-2
[Failed] 8a2acaa4 test-mac-arm
[Failed] 8a2acaa4 ursa-i9-9960x
[Finished] 8a2acaa4 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
I added -ftree-vectorize for gcc
You undid the changes I had already pushed :-(
I've been trying to get caught up on the context here - I took a look at #13654. My current understanding is:
- The problem we are trying to solve are insanely large functions generated by the codegen framework when using -O3
- The theory is that it has to do with -O3 applying tons of crazy optimizations that leads to lots of bloat due to too much vectorized code Does that sound right?
So looking at the results, -O3 adds about 1MB (to ~22MB) to the total binary size, so I think that's not an issue itself. However, there is something to be said about bloating individual kernels. Reading the other PR, it seems like one of the kernels was 40 KB big? That's quite alarming as chips these days have about 32 KB of icache. In the worst case, that's quite a bit of thrashing. That particular disassembly looks to me like the compiler is vectorizing and unrolling the loop after vectorizing it.
As for solutions: Looking at the benchmarks, it seems like the current code is pretty unstable with regards to what the compiler generates when it comes to flags. I'm not sure messing with compiler flags will be one-size-fits-all as each combination of flags causes large changes in the generated code. I did like the changes in #13654.
I really liked this point, which very much aligns with my experience and intuition that abstract templates lead to unstable code generation:
our approach (so much for "zero cost abstractions") for generalizing to abstract between writing to an array versus packing a bitmap is causing too much code to be generated.
So in my mind, two solutions we could have are:
- Keep existing code and compilation flags but explicitly disable them for problematic kernels (using something like
#pragma GCC push_optionsand#pragma GCC pop_options, though I'm not sure if there's a way to do this on MSVC). - Change the code to use fewer templates and more raw for loops. If we're feeling really adventurous, we could write a Python or Jinja script that generates the kernels as the simplest possible for loop (I know this is the approach used in a lot of databases). I have never seen a problem with this style of code even on -O3.
You undid the changes I had already pushed :-(
@pitrou, ugh sorry about that, I think I hadn't reloaded the browser. I will restore
@save-buffer thanks for your comments
Change the code to use fewer templates and more raw for loops. If we're feeling really adventurous, we could write a Python or Jinja script that generates the kernels as the simplest possible for loop (I know this is the approach used in a lot of databases). I have never seen a problem with this style of code even on -O3.
I agree also with this -- I know that some feel that manually generating code when you can have templates "do it for you" is an antipattern, but it seems at least that the code in compute/kernels/codegen_internal.h has gone a little too far introducing abstractions where we are putting too much blind faith in the compiler (e.g. the "OutputAdapter").
Not a priority by any means among our myriad priorities but perhaps something for us to occasionally hack at in our idle moments (I did #13654 when I was bored on an airplane)
In https://conbench.ursa.dev/compare/runs/e938638743e84794ad829524fae04fbd...20727b1b390e4b30be10f49db7f06f3f/ it seems that there are several hundred microbenchmarks with > 10% performance regressions but also over 100 microbenchmarks with > 10% performance improvement. I'd say it's a coin toss whether to move to -O2 (with -ftree-vectorize) versus -O3.
I would very much like to run the TPC-H benchmarks on this change. They are failing in conbench at the moment. There is a fix for these benchmarks in PR right now (#13679) so maybe we can run it after. That will at least give us some sense of the impact at a macro-level.