Statically link the `python` executable to `libpython` and disable the shared library
As part of investigating #535, we posited that Conda's static linking of the python executable was part of the performance difference.
This change gives a 10% performance improvement (geometric mean on pyperformance).
Does this affect the actual run-time performance of the Python interpreter? Or just the time to start a new process and init the interpreter?
I.e. what is the benchmark actually measuring?
It seems to have a significant holistic effect — this matches some expectations set by the conda-forge folks and @carljm.
The referenced number is on the full pyperformance benchmark suite.
It also seemed to drastically improve performance on the benchmark in #535. I can tweak the number of calculations such that it's not dominated by interpreter startup, i.e., a runtime per Python process of >5s.
I definitely intend to do more benchmarking before marking this as ready for review, I'll post full results then.
Note to self, should consider updating the following
https://github.com/astral-sh/python-build-standalone/blob/4615f2f41cfb033f1838f1ca5f7e1f41d53e7222/cpython-unix/build.py#L840
https://github.com/astral-sh/python-build-standalone/blob/4615f2f41cfb033f1838f1ca5f7e1f41d53e7222/cpython-unix/build.py#L855
https://github.com/astral-sh/python-build-standalone/blob/4615f2f41cfb033f1838f1ca5f7e1f41d53e7222/cpython-unix/build.py#L510
https://github.com/astral-sh/python-build-standalone/blob/4615f2f41cfb033f1838f1ca5f7e1f41d53e7222/cpython-unix/build.py#L531
https://github.com/astral-sh/python-build-standalone/blob/4615f2f41cfb033f1838f1ca5f7e1f41d53e7222/cpython-unix/build.py#L877
Though I think we may want to retain the shared library even if python doesn't link to it?
I think it is important to understand why static linking is faster. It could be many different things. Some of them might be fixable on shared libraries.
As a first step I would compare binaries without PGO, LTO, and BOLT.
Static linking unlocks all kinds of optimizations. I suspect what we're seeing is the result of aggressive inlining or something of that nature.
Also, I strongly prefer we still ship a libpython.so, even if Python doesn't link it. This gets you the performance without losing the shared library, which some customers will want.
Agree on all those points.
As a note, @geofft has been investigating some other problems that statically linking would solve. I expect he'll engage on exploring this further.
Are you referring to symbol resolution issues with binary packages pulling in 3rd party libraries [that can overlap with the libraries we statically link]?
I think it's mostly related to downstream consumption requiring rpath hacks, like
- https://github.com/PyO3/pyo3/pull/4890
- https://github.com/Nuitka/Nuitka/pull/3333
There is a very old (2002) Debian bug reporting that statically linking libpython is good for performance: https://bugs.debian.org/131813
There too it's about steady-state runtime performance, not startup cost. I think the idea is that there is less back-and-forth between the executable and the library, but it's a good question why this is actually true, given that most of the hot code paths should be fully within the library.
Debian does something unusual in that they ship a libpython.so too, and the way they do it is that they build twice, once with --enable-shared=no and once with --enable-shared=yes, and the resulting package has the python3.x binary and libpython.a from the former and libpython.so from the latter. The linker seems to find libpython.so first if you specify -lpython3. I am not yet confident if this is guaranteed.
If we were to ship a libpython.a then, yes, downstream consumers would have an easier time of things because no rpath is required. (Notably, cargo test in a pyo3 project would work out of the box; right now you need to write a build.rs file to set the rpath for the test runner binary.) We would have to ensure the linker finds libpython.a first, or stop shipping a libpython.so.
Note that, as implied by what Debian does, whether we ship a libpython.a and/or a libpython.so is not necessarily correlated with which one our bin/python3 uses. So, we could (at the cost of a longer build time) ship a bin/python3 that statically links libpython but also continue to ship a shared library for people who want it.
(I suppose it's also possible that this doesn't actually require two builds, and with sufficient changes to the CPython build system, you can get it to produce both a libpython.a and a libpython.so in the same build.)
Fun fact, for the third-party libraries, a handful of downstream consumers would have an easier time if we moved from a static e.g. Tcl/Tk to a shared one. (Notably, PyInstaller outputs a C binary whose splash screen uses Tcl/Tk, so they need the ability to get to those libraries from C, before they've unpacked the Python distribution.)
There is a very old (2002) Debian bug reporting that statically linking libpython is good for performance: https://bugs.debian.org/131813
I wanted to note that statically linking lib python has yielded proven performance gains for Nuitka as well.
That makes more intuitive sense to me in that Nuitka compiles what it can, so you're going back and forth between the main program and libpython for the stuff that didn't get compiled. But bin/python is literally just int main(int argc, char **argv) {return Py_BytesMain(argc, argv);} so there should be no back and forth. So the fact that there's a difference there too is a little weird, at least to my intuition!
(One mildly weird idea, btw, is that it's possible for a shared library to have an entry point—try running /lib/x86_64-linux-gnu/libc.so.6 directly, for instance. So you could imagine a distribution where bin/python is a symlink to ../lib/libpython.so, as opposed to an actual executable that depends on it, which would act sort of like a bin/python built against static libpython but be usable as a libpython.so too... but that might not help things if the actual performance problem is behavior differences from Py_ENABLE_SHARED being defined as opposed to merely being a loaded library.)
Fun fact, for the third-party libraries, a handful of downstream consumers would have an easier time if we moved from a static e.g. Tcl/Tk to a shared one. (Notably, PyInstaller outputs a C binary whose splash screen uses Tcl/Tk, so they need the ability to get to those libraries from C, before they've unpacked the Python distribution.)
Yeah, this is what I was getting at. Having them as separate libraries helps with symbol resolution issues. It was always on my undocumented backlog to split out at least tcl/tk and the x11 libraries into standalone shared libraries to mitigate this issue.
On the static vs dynamic bit, python is literally just a function call into a function in libpython. So execution shouldn't be bouncing around between those 2 ELF binaries. So the speedup can't be explained by that.
I think the speedup is coming from the compiler/linker no longer having to provide strong ABI guarantees around functions. I think statically linking libpython is enabling it to more aggressively optimize functions without regards to function boundaries.
It might be doing some funky copying of functions because I thought that you still needed to export the libpython symbols so loaded extension modules could continue using them. You'd really need to do some low-level debugging - maybe disassembling - to get to the bottom of things. I'd feed the statically linked binary into ghidra and look at the core interpreter loop to see if any funky inlining of libpython symbols is going on.
Just as an additional fyi, it seems that some corner downstream use of Python do not work as expected when using a statically linked Python, see for example (just a few I encountered in the past):
- https://github.com/conda-forge/python-feedstock/issues/595
- https://github.com/PixarAnimationStudios/OpenUSD/issues/2371
- https://github.com/JuliaPy/PythonCall.jl/issues/464#issuecomment-1985865451
I recall also a lot of macos segfaults in CMake projects creating extensions as SHARED instead of MODULE libraries, but I can't find an issue at the moment. Probably it is nothing blocking, but something that it could make sense to consider.
I recall also a lot of macos segfaults in CMake projects creating extensions as
SHAREDinstead ofMODULElibraries, but I can't find an issue at the moment.
Found: https://github.com/pybind/pybind11/issues/3907 .
Yeah, these linked issues seemingly confirm what I thought: extension module builds really want to run against the Python they were built against. If there is a mismatch between the build and runtime Python, things can blow up.
In Conda's world, they have their own universe of binary dependencies. But in PBS / uv world, there isn't as much a buffer here. So my fear is that if PBS ships a static libpython, we're signing ourselves up for all kinds of random extension module breakage.
We could assess risk by downloading popular PyPI packages and verifying extension modules load and run. But the "run" part is difficult since there's no guaranteed way to run tests from a wheel. And even if PyPI is fine, you are going to be finding people building extensions behind corporate walls encountering issues.
I want to support this work. But I'm worried about side-effects.
Thanks for sharing those @traversaro! That's helpful context.
Just for some context on how I'm thinking about this pull request: I posted this for discussion and testing — I'm not in any rush to land this.
Yeah, these linked issues seemingly confirm what I thought: extension module builds really want to run against the Python they were built against. If there is a mismatch between the build and runtime Python, things can blow up.
Just to clarify, all those issues were related to conda-installations, so that was not the problem, as compatible versions of python and libpython was used. I am not saying that mismatching build and runtime Python may not be a problem, just that the issue I linked are related to other problems. I guess what connects all linked issue is how macOS linking model deals with having duplicate symbols (even if the symbols are identical) in the Python executable and in a libpython linked in a Python extension that is being opened via dlopen.
Statically linking libpython can give a significant speed up but this configuration is not universal. I think Fedora/RHEL dynamically link where as Debian/Ubuntu statically link.
When libpython is dynamically linked there used to be significant performance gain when semantic interposition is disabled (via -fno-semantic-interposition) . Fedora found performance improvements of up to ~27% when this flag was set. Note that Fedora found similar gains from statically linking.
This flag is included by default since Python 3.10 if --enable-optimizations is specified and gcc is used. I think disabling semantic interposition for symbols within the same library is the default for clang so this flag is not needed.
c.f. https://github.com/conda-forge/python-feedstock/issues/287
We do use --enable-optimizations and Clang (for most builds) so the -fno-semantic-interposition case should be accounted for.
Superseded by #592