arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[C++][DONOTMERGE] Use -O2 instead of -O3 for RELEASE builds

Open wesm opened this issue 3 years ago • 36 comments

Motivated by investigation in #13654. To be discussed

wesm avatar Jul 20 '22 20:07 wesm

@ursabot benchmark please

wesm avatar Jul 20 '22 20:07 wesm

Supported benchmark command examples:

@ursabot benchmark help

To run all benchmarks: @ursabot please benchmark

To filter benchmarks by language: @ursabot please benchmark lang=Python @ursabot please benchmark lang=C++ @ursabot please benchmark lang=R @ursabot please benchmark lang=Java @ursabot please benchmark lang=JavaScript

To filter Python and R benchmarks by name: @ursabot please benchmark name=file-write @ursabot please benchmark name=file-write lang=Python @ursabot please benchmark name=file-.*

To filter C++ benchmarks by archery --suite-filter and --benchmark-filter: @ursabot please benchmark command=cpp-micro --suite-filter=arrow-compute-vector-selection-benchmark --benchmark-filter=TakeStringRandomIndicesWithNulls/262144/2 --iterations=3

For other command=cpp-micro options, please see https://github.com/ursacomputing/benchmarks/blob/main/benchmarks/cpp_micro_benchmarks.py

ursabot avatar Jul 20 '22 20:07 ursabot

@ursabot please benchmark

wesm avatar Jul 20 '22 20:07 wesm

Benchmark runs are scheduled for baseline = 1214083f7ece4e1797b7f3cdecfec1c2cfa8bf89 and contender = 46e31957885af71eef4adb4af6642ebb246247a7. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] test-mac-arm [Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ursa-i9-9960x [Finished :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q Buildkite builds: [Failed] 46e31957 ec2-t3-xlarge-us-east-2 [Failed] 46e31957 test-mac-arm [Failed] 46e31957 ursa-i9-9960x [Finished] 46e31957 ursa-thinkcentre-m75q [Failed] 1214083f ec2-t3-xlarge-us-east-2 [Failed] 1214083f test-mac-arm [Failed] 1214083f ursa-i9-9960x [Finished] 1214083f ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Jul 20 '22 20:07 ursabot

cc @pitrou

wesm avatar Jul 20 '22 21:07 wesm

Here's a dump of symbols that shrink the most in -O2:

https://gist.github.com/wesm/4a2815077ed37b671d6160b8abec5e7c

I'd be interested to see if e.g. unsafe numeric casts are significantly affected by this

wesm avatar Jul 20 '22 21:07 wesm

Looks great.

I believe the big regression from some tests are not real. E.g., arrow-bit-util-benchmark : BenchmarkBitmapVisitUInt8And/32768/0 drops from 13.983 GiB/s (O3) to 503.375 MiB/s (O2). Tested on my local host with clang-12, the result is 134MB/s, both O2 and O3. The huge gap is probably due to aggressive inline and optimization which makes the micro-benchmark far from reality.

One catch is gcc -O2 disables vectorization, while clang -O2 keeps it. We may need additional -fxxxx if want to keep some useful features.

cyb70289 avatar Jul 21 '22 02:07 cyb70289

Hmm, perhaps the bit-util micro-benchmarks are a bit pathologic, but other regressions seem real and quite significant...

pitrou avatar Jul 21 '22 07:07 pitrou

Not sure of gcc version used in conbench. From gcc-10 man:

-O3 Optimize yet more.  -O3 turns on all optimizations specified by -O2 and also turns on the following optimization
    flags:

    -fgcse-after-reload -finline-functions -fipa-cp-clone -floop-interchange -floop-unroll-and-jam -fpeel-loops
    -fpredictive-commoning -fsplit-paths -ftree-loop-distribute-patterns -ftree-loop-distribution -ftree-loop-vectorize
    -ftree-partial-pre -ftree-slp-vectorize -funswitch-loops -fvect-cost-model -fversion-loops-for-strides

Perhaps try -O2 -ftree-loop-vectorize? This does impact arrow-bit-util-benchmark gcc benchmark result per my test.

cyb70289 avatar Jul 21 '22 07:07 cyb70289

There's -ftree-vectorize which apparently enables all vectorization (both "tree" and "SLP").

pitrou avatar Jul 21 '22 07:07 pitrou

Also, apparently with gcc 12 the following flags are enabled with -O2: -ftree-loop-vectorize -ftree-slp-vectorize -fvect-cost-model=very-cheap

pitrou avatar Jul 21 '22 07:07 pitrou

@ursabot please benchmark lang=C++

pitrou avatar Jul 21 '22 08:07 pitrou

Benchmark runs are scheduled for baseline = 1214083f7ece4e1797b7f3cdecfec1c2cfa8bf89 and contender = e4e430b4cb511667115f0d2cb9cc4a3c104ae6ae. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] test-mac-arm [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x [Finished :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q Buildkite builds: [Finished] e4e430b4 test-mac-arm [Finished] e4e430b4 ursa-thinkcentre-m75q [Failed] 1214083f ec2-t3-xlarge-us-east-2 [Failed] 1214083f test-mac-arm [Failed] 1214083f ursa-i9-9960x [Finished] 1214083f ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Jul 21 '22 08:07 ursabot

Ok, so I run locally with gcc 9.4.0 (Ubuntu 20.04) on AMD Zen 2.

Build times (with ccache disabled)

  • -O3:
real	1m25,888s
user	21m58,527s
sys	0m48,888s
  • -O2:
real	1m21,713s
user	20m57,462s
sys	0m48,843s
  • -O2 -ftree-vectorize:
real	1m23,461s
user	21m11,970s
sys	0m49,485s

Lib sizes

  • -O3:
   text	   data	    bss	    dec	    hex	filename
23230116	 326105	2197509	25753730	188f882	build-bundled-release/release/libarrow.so
1276669	  24928	   2346	1303943	 13e587	build-bundled-release/release/libarrow_testing.so
  • -O2:
22574451	 326665	2197509	25098625	17ef981	build-bundled-release/release/libarrow.so
1268546	  25144	   2346	1296036	 13c6a4	build-bundled-release/release/libarrow_testing.so
  • -O2 -ftree-vectorize:
22639315	 326713	2197509	25163537	17ff711	build-bundled-release/release/libarrow.so
1257242	  25144	   2346	1284732	 139a7c	build-bundled-release/release/libarrow_testing.so

Compute benchmarks

  • -O2 vs -O3: https://gist.github.com/pitrou/e76b89f7e7e1f0e9a361410286c4a198
  • -O2 -ftree-vectorize vs -O3: https://gist.github.com/pitrou/feefd0819a513e6068dcd17b12d875e6
  • -O2 -ftree-vectorize vs -O2: https://gist.github.com/pitrou/083d0abd3f404be47716ae31cddda2db

All in all, in this case:

  • the size and build time reductions are significant but small (less than 5%)
  • -ftree-vectorize brings a net performance improvement over bare -O2
  • performance seems globally comparable between -O2 -ftree-vectorize and -O3, despite some large disparities in individual micro-benchmarks

pitrou avatar Jul 21 '22 11:07 pitrou

Hmm, I notice that -DNDEBUG was not passed anymore, so perhaps that affected some benchmarks :-(

Edit: re-ran benchmarks and updated the gists above.

pitrou avatar Jul 21 '22 11:07 pitrou

@ursabot please benchmark lang=C++

pitrou avatar Jul 21 '22 11:07 pitrou

Benchmark runs are scheduled for baseline = 1214083f7ece4e1797b7f3cdecfec1c2cfa8bf89 and contender = 6dd7bab55433062c767c7f170cdfb7e5d5a77943. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] test-mac-arm [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x [Finished :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q Buildkite builds: [Finished] 6dd7bab5 test-mac-arm [Finished] 6dd7bab5 ursa-thinkcentre-m75q [Failed] 1214083f ec2-t3-xlarge-us-east-2 [Failed] 1214083f test-mac-arm [Failed] 1214083f ursa-i9-9960x [Finished] 1214083f ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Jul 21 '22 11:07 ursabot

cc @save-buffer @westonpace for further opinions

pitrou avatar Jul 21 '22 13:07 pitrou

I would guess there are places that benefit from loop unswitching also.

wesm avatar Jul 21 '22 13:07 wesm

I would guess there are places that benefit from loop unswitching also.

Hmm, I don't think it's our duty to micro-optimize compiler options, though. There are too many moving parts (compiler brand, compiler version, architecture, etc.).

pitrou avatar Jul 21 '22 14:07 pitrou

I would say that we should just keep O3 and keep an eye on symbol sizes in case we need to intervene occasionally. On the whole I think the symbol sizes we have are not too bad.

wesm avatar Jul 21 '22 14:07 wesm

@ursabot please benchmark lang=C++

wesm avatar Jul 21 '22 19:07 wesm

Benchmark runs are scheduled for baseline = 8a2acaa40b97e6eafba196812d15a71f49f69d6a and contender = 7e5ca1a5b33e0c5e338e2c5f706b237defe91cb4. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] test-mac-arm [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x [Finished :arrow_down:0.0% :arrow_up:0.0% :warning: Contender and baseline run contexts do not match] ursa-thinkcentre-m75q Buildkite builds: [Finished] 7e5ca1a5 test-mac-arm [Finished] 7e5ca1a5 ursa-thinkcentre-m75q [Failed] 8a2acaa4 ec2-t3-xlarge-us-east-2 [Failed] 8a2acaa4 test-mac-arm [Failed] 8a2acaa4 ursa-i9-9960x [Finished] 8a2acaa4 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Jul 21 '22 19:07 ursabot

I added -ftree-vectorize for gcc

wesm avatar Jul 21 '22 19:07 wesm

You undid the changes I had already pushed :-(

pitrou avatar Jul 21 '22 19:07 pitrou

I've been trying to get caught up on the context here - I took a look at #13654. My current understanding is:

  • The problem we are trying to solve are insanely large functions generated by the codegen framework when using -O3
  • The theory is that it has to do with -O3 applying tons of crazy optimizations that leads to lots of bloat due to too much vectorized code Does that sound right?

So looking at the results, -O3 adds about 1MB (to ~22MB) to the total binary size, so I think that's not an issue itself. However, there is something to be said about bloating individual kernels. Reading the other PR, it seems like one of the kernels was 40 KB big? That's quite alarming as chips these days have about 32 KB of icache. In the worst case, that's quite a bit of thrashing. That particular disassembly looks to me like the compiler is vectorizing and unrolling the loop after vectorizing it.

As for solutions: Looking at the benchmarks, it seems like the current code is pretty unstable with regards to what the compiler generates when it comes to flags. I'm not sure messing with compiler flags will be one-size-fits-all as each combination of flags causes large changes in the generated code. I did like the changes in #13654.

I really liked this point, which very much aligns with my experience and intuition that abstract templates lead to unstable code generation:

our approach (so much for "zero cost abstractions") for generalizing to abstract between writing to an array versus packing a bitmap is causing too much code to be generated.

So in my mind, two solutions we could have are:

  • Keep existing code and compilation flags but explicitly disable them for problematic kernels (using something like #pragma GCC push_options and #pragma GCC pop_options, though I'm not sure if there's a way to do this on MSVC).
  • Change the code to use fewer templates and more raw for loops. If we're feeling really adventurous, we could write a Python or Jinja script that generates the kernels as the simplest possible for loop (I know this is the approach used in a lot of databases). I have never seen a problem with this style of code even on -O3.

save-buffer avatar Jul 21 '22 20:07 save-buffer

You undid the changes I had already pushed :-(

@pitrou, ugh sorry about that, I think I hadn't reloaded the browser. I will restore

wesm avatar Jul 21 '22 20:07 wesm

@save-buffer thanks for your comments

Change the code to use fewer templates and more raw for loops. If we're feeling really adventurous, we could write a Python or Jinja script that generates the kernels as the simplest possible for loop (I know this is the approach used in a lot of databases). I have never seen a problem with this style of code even on -O3.

I agree also with this -- I know that some feel that manually generating code when you can have templates "do it for you" is an antipattern, but it seems at least that the code in compute/kernels/codegen_internal.h has gone a little too far introducing abstractions where we are putting too much blind faith in the compiler (e.g. the "OutputAdapter").

Not a priority by any means among our myriad priorities but perhaps something for us to occasionally hack at in our idle moments (I did #13654 when I was bored on an airplane)

wesm avatar Jul 21 '22 21:07 wesm

In https://conbench.ursa.dev/compare/runs/e938638743e84794ad829524fae04fbd...20727b1b390e4b30be10f49db7f06f3f/ it seems that there are several hundred microbenchmarks with > 10% performance regressions but also over 100 microbenchmarks with > 10% performance improvement. I'd say it's a coin toss whether to move to -O2 (with -ftree-vectorize) versus -O3.

wesm avatar Jul 21 '22 22:07 wesm

I would very much like to run the TPC-H benchmarks on this change. They are failing in conbench at the moment. There is a fix for these benchmarks in PR right now (#13679) so maybe we can run it after. That will at least give us some sense of the impact at a macro-level.

westonpace avatar Jul 22 '22 17:07 westonpace