cpython icon indicating copy to clipboard operation
cpython copied to clipboard

gh-90536: Add support for the BOLT post-link binary optimizer

Open kmod opened this issue 2 years ago • 14 comments

Using bolt provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks.

It is gated behind an --enable-bolt configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt).

Compared to a previous attempt, this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache.

This change uses the existing pgo profiling task (python -m test --pgo), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.

  • Issue: gh-90536

kmod avatar Aug 11 '22 22:08 kmod

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

bedevere-bot avatar Aug 11 '22 22:08 bedevere-bot

Thanks! I hope @corona10 can review and merge this, and maybe @pablogsal will be willing to backport it to 3.11.

gvanrossum avatar Aug 11 '22 23:08 gvanrossum

and maybe @pablogsal will be willing to backport it to 3.11.

Unfortunately, changes in the configure script or makefile are too much at this stage, especially for a new feature that has not been tested in the wild (by users checking the pre-releases). Sadly, this must go to 3.12.

pablogsal avatar Aug 11 '22 23:08 pablogsal

Nice work! I will take a look at this PR by this weekend

corona10 avatar Aug 11 '22 23:08 corona10

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

bedevere-bot avatar Aug 12 '22 16:08 bedevere-bot

@kmod

I am verifying the patch on the c6i.xlarge EC2 instance. Would you like to provide the compiler version you used?

corona10 avatar Aug 13 '22 03:08 corona10

Hmm, I will try to build BOLT from LLVM 14.0.6

corona10 avatar Aug 13 '22 11:08 corona10

I found why the BOLT was failed, I will downgrade the gcc version into 10.


DWARF 5 has become the default in GCC 11

corona10 avatar Aug 13 '22 12:08 corona10

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

bedevere-bot avatar Aug 13 '22 14:08 bedevere-bot

@gvanrossum @kmod cc @markshannon

Interesting result! The following benchmark was measured on AWS c5n.metal / gcc-10. (base commit: f235178beccf5eb5b47e770240f32d9ba24b26fd) I wish to re-measure the benchmark from the FasterCPython project machine also. I am going to measure the L1 i-cache miss ratio soon where the perf tool is available.

Benchmark CPython 3.12 ./configure --enable-optimizations --with-lto CPython 3.12 ./configure --enable-optimizations --with-lto --enable-bolt
2to3 269 ms 255 ms: 1.05x faster
chameleon 7.39 ms 7.02 ms: 1.05x faster
chaos 74.1 ms 68.8 ms: 1.08x faster
crypto_pyaes 82.3 ms 77.2 ms: 1.07x faster
deltablue 3.65 ms 3.41 ms: 1.07x faster
django_template 38.6 ms 35.3 ms: 1.09x faster
dulwich_log 67.6 ms 58.7 ms: 1.15x faster
fannkuch 385 ms 380 ms: 1.02x faster
float 73.2 ms 72.4 ms: 1.01x faster
genshi_text 24.3 ms 23.3 ms: 1.04x faster
genshi_xml 56.4 ms 52.8 ms: 1.07x faster
go 140 ms 136 ms: 1.03x faster
hexiom 6.40 ms 6.25 ms: 1.02x faster
html5lib 65.0 ms 60.7 ms: 1.07x faster
json_dumps 11.1 ms 10.4 ms: 1.07x faster
json_loads 28.7 us 26.3 us: 1.09x faster
logging_format 7.29 us 6.69 us: 1.09x faster
logging_silent 101 ns 97.6 ns: 1.03x faster
logging_simple 6.48 us 6.01 us: 1.08x faster
mako 10.6 ms 9.91 ms: 1.07x faster
meteor_contest 106 ms 102 ms: 1.04x faster
nbody 86.4 ms 87.7 ms: 1.02x slower
nqueens 91.3 ms 88.1 ms: 1.04x faster
pathlib 19.0 ms 16.8 ms: 1.13x faster
pickle_dict 32.2 us 32.6 us: 1.01x slower
pickle_list 4.69 us 4.62 us: 1.02x faster
pickle_pure_python 297 us 282 us: 1.05x faster
pidigits 177 ms 176 ms: 1.01x faster
pyflate 423 ms 416 ms: 1.02x faster
python_startup 8.72 ms 8.15 ms: 1.07x faster
python_startup_no_site 6.35 ms 5.97 ms: 1.06x faster
raytrace 312 ms 293 ms: 1.06x faster
regex_compile 139 ms 131 ms: 1.06x faster
regex_dna 180 ms 185 ms: 1.03x slower
regex_effbot 2.99 ms 2.82 ms: 1.06x faster
regex_v8 21.4 ms 20.4 ms: 1.05x faster
richards 48.6 ms 46.3 ms: 1.05x faster
scimark_fft 348 ms 338 ms: 1.03x faster
scimark_lu 120 ms 117 ms: 1.02x faster
scimark_monte_carlo 67.0 ms 65.4 ms: 1.02x faster
scimark_sor 116 ms 113 ms: 1.02x faster
spectral_norm 101 ms 102 ms: 1.01x slower
sqlalchemy_declarative 143 ms 135 ms: 1.06x faster
sqlalchemy_imperative 19.0 ms 17.0 ms: 1.12x faster
sqlite_synth 2.50 us 2.29 us: 1.09x faster
sympy_expand 507 ms 465 ms: 1.09x faster
sympy_integrate 21.7 ms 20.5 ms: 1.06x faster
sympy_sum 176 ms 164 ms: 1.08x faster
sympy_str 311 ms 286 ms: 1.09x faster
telco 7.02 ms 6.36 ms: 1.10x faster
tornado_http 125 ms 113 ms: 1.10x faster
unpickle 15.7 us 15.1 us: 1.04x faster
unpickle_list 4.74 us 4.56 us: 1.04x faster
unpickle_pure_python 229 us 219 us: 1.05x faster
xml_etree_parse 158 ms 155 ms: 1.02x faster
xml_etree_iterparse 103 ms 101 ms: 1.02x faster
xml_etree_generate 91.0 ms 84.3 ms: 1.08x faster
xml_etree_process 61.9 ms 58.4 ms: 1.06x faster
Geometric mean (ref) 1.05x faster

Benchmark hidden because not significant (3): pickle, scimark_sparse_mat_mult, unpack_sequence

corona10 avatar Aug 13 '22 16:08 corona10

Another benchmark from Azure VM(Ubuntu 20.04.4 LTS gcc 9.4.0): https://gist.github.com/corona10/c2aa0108a5ffcc96be449c0ce033412d

But let's measure the benchmark from the Faster CPython machine after the PR is merged.

corona10 avatar Aug 14 '22 08:08 corona10

I success to get cache miss-related metadata and also I got the pyperformance result which is similar to my previous attempts and Kevin's report. I didn't analyze whether the GCC version or OS version could affect the performance result. But I can conclude that BOLT definitely makes CPython faster.

Environment

  • Hardware: AWS c5n.metal
  • Red Hat Enterprise Linux release 8.6 (Ootpa)
  • gcc: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10)
  • LLVM version 14.0.6

Binary Size

  • Without BOLT: 79M
  • With BOLT: 36M

ICache miss

Experiment instructions L1-icache-misses ratio
PGO + LTO 8,330,863,079,932 77,047,357,163 0.92%
PGO + LTO + BOLT 8,312,698,165,975 65,319,225,064 0.79%

Benchmark (1.01x faster)

https://gist.github.com/corona10/5726d1528176677d4c694265edfc4bf5

corona10 avatar Aug 15 '22 04:08 corona10

Thanks for taking a look! Yes many of the Pyston macrobenchmarks broke in 3.11 but it looks like @mdboom is currently working updating the dependencies to versions that are compatible with 3.11.

I have made the requested changes; please review again

kmod avatar Aug 16 '22 21:08 kmod

Thanks for making the requested changes!

@corona10: please review the changes made to this pull request.

bedevere-bot avatar Aug 16 '22 21:08 bedevere-bot

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be put in the comfy chair!

bedevere-bot avatar Aug 16 '22 23:08 bedevere-bot

Hi,

One thing to note is that Python program may spend considerable time in shared libraries corresponding to modules, and BOLT has an ability to optimize them as well. I would suggest profiling benchmarks with perf and optimizing .so's as well.

aaupov avatar Aug 18 '22 22:08 aaupov

@aaupov I would like to recommend creating issue for your suggestions on https://github.com/faster-cpython/ideas or https://github.com/python/cpython/issues. I think that is faster-cpython repo is more proper :)

corona10 avatar Aug 18 '22 22:08 corona10

Another question and important view of performance tuning.

Gcc pgo and clang pgo are different , and gcc pgo profiler like profile-generate, can get more deeply data for pgo, instead of clang profile-generate.

So, would be nice to make new flags with

--enable-lto-gcc --enable-pgo-gcc,but considering at gcc level reorder flag needing for BOLT at clang

  • bolting

And one compilechain completely in clang --enable-lto-llvm --enable-pgo-llvm plus bolt

Thank you very much

osevan avatar Dec 18 '22 23:12 osevan