cpython
cpython copied to clipboard
gh-90536: Add support for the BOLT post-link binary optimizer
Using bolt provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks.
It is gated behind an --enable-bolt
configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).
Compared to a previous attempt, this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt.
The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%).
The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache.
This change uses the existing pgo profiling task (python -m test --pgo
),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.
- Issue: gh-90536
Most changes to Python require a NEWS entry.
Please add it using the blurb_it web app or the blurb command-line tool.
Thanks! I hope @corona10 can review and merge this, and maybe @pablogsal will be willing to backport it to 3.11.
and maybe @pablogsal will be willing to backport it to 3.11.
Unfortunately, changes in the configure script or makefile are too much at this stage, especially for a new feature that has not been tested in the wild (by users checking the pre-releases). Sadly, this must go to 3.12.
Nice work! I will take a look at this PR by this weekend
Most changes to Python require a NEWS entry.
Please add it using the blurb_it web app or the blurb command-line tool.
@kmod
I am verifying the patch on the c6i.xlarge EC2 instance. Would you like to provide the compiler version you used?
Hmm, I will try to build BOLT from LLVM 14.0.6
I found why the BOLT was failed, I will downgrade the gcc version into 10.
DWARF 5 has become the default in GCC 11
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.
Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again
. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.
@gvanrossum @kmod cc @markshannon
Interesting result! The following benchmark was measured on AWS c5n.metal / gcc-10. (base commit: f235178beccf5eb5b47e770240f32d9ba24b26fd) I wish to re-measure the benchmark from the FasterCPython project machine also. I am going to measure the L1 i-cache miss ratio soon where the perf tool is available.
Benchmark | CPython 3.12 ./configure --enable-optimizations --with-lto | CPython 3.12 ./configure --enable-optimizations --with-lto --enable-bolt |
---|---|---|
2to3 | 269 ms | 255 ms: 1.05x faster |
chameleon | 7.39 ms | 7.02 ms: 1.05x faster |
chaos | 74.1 ms | 68.8 ms: 1.08x faster |
crypto_pyaes | 82.3 ms | 77.2 ms: 1.07x faster |
deltablue | 3.65 ms | 3.41 ms: 1.07x faster |
django_template | 38.6 ms | 35.3 ms: 1.09x faster |
dulwich_log | 67.6 ms | 58.7 ms: 1.15x faster |
fannkuch | 385 ms | 380 ms: 1.02x faster |
float | 73.2 ms | 72.4 ms: 1.01x faster |
genshi_text | 24.3 ms | 23.3 ms: 1.04x faster |
genshi_xml | 56.4 ms | 52.8 ms: 1.07x faster |
go | 140 ms | 136 ms: 1.03x faster |
hexiom | 6.40 ms | 6.25 ms: 1.02x faster |
html5lib | 65.0 ms | 60.7 ms: 1.07x faster |
json_dumps | 11.1 ms | 10.4 ms: 1.07x faster |
json_loads | 28.7 us | 26.3 us: 1.09x faster |
logging_format | 7.29 us | 6.69 us: 1.09x faster |
logging_silent | 101 ns | 97.6 ns: 1.03x faster |
logging_simple | 6.48 us | 6.01 us: 1.08x faster |
mako | 10.6 ms | 9.91 ms: 1.07x faster |
meteor_contest | 106 ms | 102 ms: 1.04x faster |
nbody | 86.4 ms | 87.7 ms: 1.02x slower |
nqueens | 91.3 ms | 88.1 ms: 1.04x faster |
pathlib | 19.0 ms | 16.8 ms: 1.13x faster |
pickle_dict | 32.2 us | 32.6 us: 1.01x slower |
pickle_list | 4.69 us | 4.62 us: 1.02x faster |
pickle_pure_python | 297 us | 282 us: 1.05x faster |
pidigits | 177 ms | 176 ms: 1.01x faster |
pyflate | 423 ms | 416 ms: 1.02x faster |
python_startup | 8.72 ms | 8.15 ms: 1.07x faster |
python_startup_no_site | 6.35 ms | 5.97 ms: 1.06x faster |
raytrace | 312 ms | 293 ms: 1.06x faster |
regex_compile | 139 ms | 131 ms: 1.06x faster |
regex_dna | 180 ms | 185 ms: 1.03x slower |
regex_effbot | 2.99 ms | 2.82 ms: 1.06x faster |
regex_v8 | 21.4 ms | 20.4 ms: 1.05x faster |
richards | 48.6 ms | 46.3 ms: 1.05x faster |
scimark_fft | 348 ms | 338 ms: 1.03x faster |
scimark_lu | 120 ms | 117 ms: 1.02x faster |
scimark_monte_carlo | 67.0 ms | 65.4 ms: 1.02x faster |
scimark_sor | 116 ms | 113 ms: 1.02x faster |
spectral_norm | 101 ms | 102 ms: 1.01x slower |
sqlalchemy_declarative | 143 ms | 135 ms: 1.06x faster |
sqlalchemy_imperative | 19.0 ms | 17.0 ms: 1.12x faster |
sqlite_synth | 2.50 us | 2.29 us: 1.09x faster |
sympy_expand | 507 ms | 465 ms: 1.09x faster |
sympy_integrate | 21.7 ms | 20.5 ms: 1.06x faster |
sympy_sum | 176 ms | 164 ms: 1.08x faster |
sympy_str | 311 ms | 286 ms: 1.09x faster |
telco | 7.02 ms | 6.36 ms: 1.10x faster |
tornado_http | 125 ms | 113 ms: 1.10x faster |
unpickle | 15.7 us | 15.1 us: 1.04x faster |
unpickle_list | 4.74 us | 4.56 us: 1.04x faster |
unpickle_pure_python | 229 us | 219 us: 1.05x faster |
xml_etree_parse | 158 ms | 155 ms: 1.02x faster |
xml_etree_iterparse | 103 ms | 101 ms: 1.02x faster |
xml_etree_generate | 91.0 ms | 84.3 ms: 1.08x faster |
xml_etree_process | 61.9 ms | 58.4 ms: 1.06x faster |
Geometric mean | (ref) | 1.05x faster |
Benchmark hidden because not significant (3): pickle, scimark_sparse_mat_mult, unpack_sequence
Another benchmark from Azure VM(Ubuntu 20.04.4 LTS gcc 9.4.0): https://gist.github.com/corona10/c2aa0108a5ffcc96be449c0ce033412d
But let's measure the benchmark from the Faster CPython machine after the PR is merged.
I success to get cache miss-related metadata and also I got the pyperformance result which is similar to my previous attempts and Kevin's report. I didn't analyze whether the GCC version or OS version could affect the performance result. But I can conclude that BOLT definitely makes CPython faster.
Environment
- Hardware: AWS c5n.metal
- Red Hat Enterprise Linux release 8.6 (Ootpa)
- gcc: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10)
- LLVM version 14.0.6
Binary Size
- Without BOLT: 79M
- With BOLT: 36M
ICache miss
Experiment | instructions | L1-icache-misses | ratio |
---|---|---|---|
PGO + LTO | 8,330,863,079,932 | 77,047,357,163 | 0.92% |
PGO + LTO + BOLT | 8,312,698,165,975 | 65,319,225,064 | 0.79% |
Benchmark (1.01x faster)
https://gist.github.com/corona10/5726d1528176677d4c694265edfc4bf5
Thanks for taking a look! Yes many of the Pyston macrobenchmarks broke in 3.11 but it looks like @mdboom is currently working updating the dependencies to versions that are compatible with 3.11.
I have made the requested changes; please review again
Thanks for making the requested changes!
@corona10: please review the changes made to this pull request.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.
Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again
. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.
And if you don't make the requested changes, you will be put in the comfy chair!
Hi,
One thing to note is that Python program may spend considerable time in shared libraries corresponding to modules, and BOLT has an ability to optimize them as well. I would suggest profiling benchmarks with perf and optimizing .so's as well.
@aaupov I would like to recommend creating issue for your suggestions on https://github.com/faster-cpython/ideas or https://github.com/python/cpython/issues. I think that is faster-cpython repo is more proper :)
Another question and important view of performance tuning.
Gcc pgo and clang pgo are different , and gcc pgo profiler like profile-generate, can get more deeply data for pgo, instead of clang profile-generate.
So, would be nice to make new flags with
--enable-lto-gcc --enable-pgo-gcc,but considering at gcc level reorder flag needing for BOLT at clang
- bolting
And one compilechain completely in clang --enable-lto-llvm --enable-pgo-llvm plus bolt
Thank you very much