cpython gh-90536: Add support for the BOLT post-link binary optimizer

Using bolt provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks.

It is gated behind an --enable-bolt configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt).

Compared to a previous attempt, this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache.

This change uses the existing pgo profiling task (python -m test --pgo), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.

Issue: gh-90536

Aug 11 '22 22:08 kmod

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

Aug 11 '22 22:08 bedevere-bot

Thanks! I hope @corona10 can review and merge this, and maybe @pablogsal will be willing to backport it to 3.11.

Aug 11 '22 23:08 gvanrossum

and maybe @pablogsal will be willing to backport it to 3.11.

Unfortunately, changes in the configure script or makefile are too much at this stage, especially for a new feature that has not been tested in the wild (by users checking the pre-releases). Sadly, this must go to 3.12.

Aug 11 '22 23:08 pablogsal

Nice work! I will take a look at this PR by this weekend

Aug 11 '22 23:08 corona10

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

Aug 12 '22 16:08 bedevere-bot

@kmod

I am verifying the patch on the c6i.xlarge EC2 instance. Would you like to provide the compiler version you used?

Aug 13 '22 03:08 corona10

Hmm, I will try to build BOLT from LLVM 14.0.6

Aug 13 '22 11:08 corona10

I found why the BOLT was failed, I will downgrade the gcc version into 10.


DWARF 5 has become the default in GCC 11

Aug 13 '22 12:08 corona10

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Aug 13 '22 14:08 bedevere-bot

@gvanrossum @kmod cc @markshannon

Interesting result! The following benchmark was measured on AWS c5n.metal / gcc-10. (base commit: f235178beccf5eb5b47e770240f32d9ba24b26fd) I wish to re-measure the benchmark from the FasterCPython project machine also. I am going to measure the L1 i-cache miss ratio soon where the perf tool is available.

Benchmark	CPython 3.12 ./configure --enable-optimizations --with-lto	CPython 3.12 ./configure --enable-optimizations --with-lto --enable-bolt
2to3	269 ms	255 ms: 1.05x faster
chameleon	7.39 ms	7.02 ms: 1.05x faster
chaos	74.1 ms	68.8 ms: 1.08x faster
crypto_pyaes	82.3 ms	77.2 ms: 1.07x faster
deltablue	3.65 ms	3.41 ms: 1.07x faster
django_template	38.6 ms	35.3 ms: 1.09x faster
dulwich_log	67.6 ms	58.7 ms: 1.15x faster
fannkuch	385 ms	380 ms: 1.02x faster
float	73.2 ms	72.4 ms: 1.01x faster
genshi_text	24.3 ms	23.3 ms: 1.04x faster
genshi_xml	56.4 ms	52.8 ms: 1.07x faster
go	140 ms	136 ms: 1.03x faster
hexiom	6.40 ms	6.25 ms: 1.02x faster
html5lib	65.0 ms	60.7 ms: 1.07x faster
json_dumps	11.1 ms	10.4 ms: 1.07x faster
json_loads	28.7 us	26.3 us: 1.09x faster
logging_format	7.29 us	6.69 us: 1.09x faster
logging_silent	101 ns	97.6 ns: 1.03x faster
logging_simple	6.48 us	6.01 us: 1.08x faster
mako	10.6 ms	9.91 ms: 1.07x faster
meteor_contest	106 ms	102 ms: 1.04x faster
nbody	86.4 ms	87.7 ms: 1.02x slower
nqueens	91.3 ms	88.1 ms: 1.04x faster
pathlib	19.0 ms	16.8 ms: 1.13x faster
pickle_dict	32.2 us	32.6 us: 1.01x slower
pickle_list	4.69 us	4.62 us: 1.02x faster
pickle_pure_python	297 us	282 us: 1.05x faster
pidigits	177 ms	176 ms: 1.01x faster
pyflate	423 ms	416 ms: 1.02x faster
python_startup	8.72 ms	8.15 ms: 1.07x faster
python_startup_no_site	6.35 ms	5.97 ms: 1.06x faster
raytrace	312 ms	293 ms: 1.06x faster
regex_compile	139 ms	131 ms: 1.06x faster
regex_dna	180 ms	185 ms: 1.03x slower
regex_effbot	2.99 ms	2.82 ms: 1.06x faster
regex_v8	21.4 ms	20.4 ms: 1.05x faster
richards	48.6 ms	46.3 ms: 1.05x faster
scimark_fft	348 ms	338 ms: 1.03x faster
scimark_lu	120 ms	117 ms: 1.02x faster
scimark_monte_carlo	67.0 ms	65.4 ms: 1.02x faster
scimark_sor	116 ms	113 ms: 1.02x faster
spectral_norm	101 ms	102 ms: 1.01x slower
sqlalchemy_declarative	143 ms	135 ms: 1.06x faster
sqlalchemy_imperative	19.0 ms	17.0 ms: 1.12x faster
sqlite_synth	2.50 us	2.29 us: 1.09x faster
sympy_expand	507 ms	465 ms: 1.09x faster
sympy_integrate	21.7 ms	20.5 ms: 1.06x faster
sympy_sum	176 ms	164 ms: 1.08x faster
sympy_str	311 ms	286 ms: 1.09x faster
telco	7.02 ms	6.36 ms: 1.10x faster
tornado_http	125 ms	113 ms: 1.10x faster
unpickle	15.7 us	15.1 us: 1.04x faster
unpickle_list	4.74 us	4.56 us: 1.04x faster
unpickle_pure_python	229 us	219 us: 1.05x faster
xml_etree_parse	158 ms	155 ms: 1.02x faster
xml_etree_iterparse	103 ms	101 ms: 1.02x faster
xml_etree_generate	91.0 ms	84.3 ms: 1.08x faster
xml_etree_process	61.9 ms	58.4 ms: 1.06x faster
Geometric mean	(ref)	1.05x faster

Benchmark hidden because not significant (3): pickle, scimark_sparse_mat_mult, unpack_sequence

Aug 13 '22 16:08 corona10

Another benchmark from Azure VM(Ubuntu 20.04.4 LTS gcc 9.4.0): https://gist.github.com/corona10/c2aa0108a5ffcc96be449c0ce033412d

But let's measure the benchmark from the Faster CPython machine after the PR is merged.

Aug 14 '22 08:08 corona10

I success to get cache miss-related metadata and also I got the pyperformance result which is similar to my previous attempts and Kevin's report. I didn't analyze whether the GCC version or OS version could affect the performance result. But I can conclude that BOLT definitely makes CPython faster.

Environment

Hardware: AWS c5n.metal
Red Hat Enterprise Linux release 8.6 (Ootpa)
gcc: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10)
LLVM version 14.0.6

Binary Size

Without BOLT: 79M
With BOLT: 36M

ICache miss

Experiment	instructions	L1-icache-misses	ratio
PGO + LTO	8,330,863,079,932	77,047,357,163	0.92%
PGO + LTO + BOLT	8,312,698,165,975	65,319,225,064	0.79%

Benchmark (1.01x faster)

https://gist.github.com/corona10/5726d1528176677d4c694265edfc4bf5

Aug 15 '22 04:08 corona10

Thanks for taking a look! Yes many of the Pyston macrobenchmarks broke in 3.11 but it looks like @mdboom is currently working updating the dependencies to versions that are compatible with 3.11.

I have made the requested changes; please review again

Aug 16 '22 21:08 kmod

Thanks for making the requested changes!

@corona10: please review the changes made to this pull request.

Aug 16 '22 21:08 bedevere-bot

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be put in the comfy chair!

Aug 16 '22 23:08 bedevere-bot

Hi,

One thing to note is that Python program may spend considerable time in shared libraries corresponding to modules, and BOLT has an ability to optimize them as well. I would suggest profiling benchmarks with perf and optimizing .so's as well.

Aug 18 '22 22:08 aaupov

@aaupov I would like to recommend creating issue for your suggestions on https://github.com/faster-cpython/ideas or https://github.com/python/cpython/issues. I think that is faster-cpython repo is more proper :)

Aug 18 '22 22:08 corona10

Another question and important view of performance tuning.

Gcc pgo and clang pgo are different , and gcc pgo profiler like profile-generate, can get more deeply data for pgo, instead of clang profile-generate.

So, would be nice to make new flags with

--enable-lto-gcc --enable-pgo-gcc,but considering at gcc level reorder flag needing for BOLT at clang

bolting

And one compilechain completely in clang --enable-lto-llvm --enable-pgo-llvm plus bolt

Thank you very much

Dec 18 '22 23:12 osevan

cpython cpython copied to clipboard

gh-90536: Add support for the BOLT post-link binary optimizer

Environment

Binary Size

ICache miss

Benchmark (1.01x faster)

cpython
cpython copied to clipboard