llvm-mingw Building with LTO

This would allow the compiler to further optimize LLVM and increase the speed of the resulting toolchain (maybe even reduce the size?). Is this possible?

Mar 21 '21 02:03 longnguyen2004

Yes, but it comes at a noticable cost.

First off, if you're e.g. on linux and building just a cross compiler, then the compiler itself will be built by your distribution's default compiler. In the case of Ubuntu 18.04, this is GCC 7. With the newly built clang, one could recompile LLVM, letting the new clang optimize the toolchain being built. This would be called a 2-stage build. This in itself roughly doubles the compilation time if building just a cross compiler. If building a toolchain that is going to be running on Windows, this already happens (as one first builds the cross compiler, and then uses it to build a toolchain for WIndows).

On Ubuntu 18.04, a Clang 12.0 compiled with itself is around 12% faster than when it's built by the system GCC 7.

Let's first rehash a few basics; if building with LTO (with Clang), you're shifting work done from compile commands to the link command. (Each compilation command stops after emitting the LLVM IR, deferring code generation and other optimizations to later.)

LLVM LTO has two modes, full LTO and thin LTO. In full LTO, all the IR from all the input object files is compiled as one big module during linking, and all of that runs serially in one thread. If you build one binary this way, the total amount of CPU time used is roughly the same as otherwise, but the loss of parallelization makes it take way longer in wallclock time.

With thin LTO, the link time code generation and optimization is split into a number of parallel modules, so it can use threading. So if you build just one binary with Thin LTO on a multicore system, it takes roughly as long as it would without LTO. Thin LTO also has some sort of cache for parts of the intermediate products, to speed up incremental linking.

However, that cache doesn't seem to have a whole lot of effect when you're building lots of tools that all use some subset of the LLVM libraries, as each of them will do link-time code generation of parts of the build that don't match what the cache has from linking other tools.

So when building LLVM, which consists of a number of separate tools, enabling LTO is kinda costly as it runs LTO code generation individually for each of the N tools that are linked while building LLVM. (As thin LTO linking is threaded itself, the LLVM CMake build system forces linking to only be done with 2 linker jobs in parallel, if using thin LTO.)

A Clang 12.0 built with itself with LTO is around 4-6% faster than one built without it. It doesn't seem to end up smaller though, on the contrary it's a bit bigger.

However, when it comes to time to build it, it sure is kinda costly. On a system with a lot of cores:

Built with	Wallclock	CPU time	Binary size	Runtime speedup over the previous
GCC 7	4 min 36 sec	342 min 42 sec	100 MB	-
Clang 12, No LTO	4 min 6 sec	308 min 30 sec	82 MB	12%
Clang 12, Thin LTO	15 min 19 sec	788 min 2 sec	98 MB	4-6%
Clang 12, Full LTO	26 min 10 sec	622 min 9 sec	98 MB	1%

So while it does make things faster, the cost for building it is quite high.

I pushed a commit that lets you enable this yourself if you want to, 4012843f3e4dfb08168b545660f29e05c68656cc, where you can pass --stage2, --thinlto and --lto to build-llvm.sh. One way of doing it would e.g. to build a full cross compiler using ./build-all.sh <destination>, followed by a ./build-llvm <destination> --stage2 --thinlto, to rebuild clang using the clang in that directory, with thin LTO. If cross-building a toolchain to run on windows, you should be able to just add --thinlto to the build-llvm.sh call that builds the LLVM tools for windows.

Mar 22 '21 12:03 mstorsjo

Now that we're using LLVM_LINK_LLVM_DYLIB, wouldn't the build-time cost of LTO be reduced quite a bid?

Aug 17 '21 18:08 Keithcat1

Now that we're using LLVM_LINK_LLVM_DYLIB, wouldn't the build-time cost of LTO be reduced quite a bid?

That's quite possible, yes - in theory I guess it should end up with close to no duplicated work due to LTO, compared to when using static libraries. The actual benefit from LTO on the final binary might be a bit smaller than in a fully static case though.

Aug 17 '21 18:08 mstorsjo

I imagine that for example, all of the LLVM optimizing code is in libLLVM*.dll. Clang probably spends most of its time in there and since it's in the same library, it should be optimized as well as in a static build unless Clang calls into it a lot.

On 8/17/21, Martin Storsjö @.***> wrote:

Now that we're using LLVM_LINK_LLVM_DYLIB, wouldn't the build-time cost of LTO be reduced quite a bid?

That's quite possible, yes - in theory I guess it should end up with close to no duplicated work due to LTO, compared to when using static libraries. The actual benefit from LTO on the final binary might be a bit smaller than in a fully static case though.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/mstorsjo/llvm-mingw/issues/192#issuecomment-900532621

Aug 17 '21 18:08 Keithcat1

Out of curiosity, I tried to recreate the benchmark above with the currently pinned version (13.0.0 RC1), and it pretty much confirms the observations above:

ThinLTO is much less expensive to build with LLVM_LINK_LLVM_DYLIB enabled
The actual speedup from building Clang with LTO is still fairly small (smaller in this test than before)

Built with	Wallclock	CPU time	Binary size	Runtime speedup over the previous
GCC 7	4 min 38 sec	351 min 14 sec	125 MB	-
Clang 13, No LTO	4 min 18 sec	344 min 6 sec	113 MB	7%
Clang 13, Thin LTO	6 min 8 sec	383 min 45 sec	126 MB	1%
Clang 13, Full LTO	32 min 29 sec	328 min 37 sec	126 MB	3%

I also tried to compare to the configuration without dylibs, and with Clang 12. The following numbers are total compilation time for a test subject (QtBase 6.1):

Built with ->	GCC	Clang	Clang ThinLTO
Clang 13 Dylib	2634 sec	2465 sec	2443 sec
Clang 13	2669 sec	2461 sec	2339 sec
Clang 12	2767 sec	2452 sec	2358 sec

So overall in both Clang 12 and 13, ThinLTO does make things faster but not by a whole lot. The difference between being compiled with GCC and Clang is much smaller in Clang 13 than 12.

Aug 18 '21 13:08 mstorsjo

By the way, Rust ships its standard libraries with .llvmbc sections in the object files that contain LLVM bitcode, allowing link time optimization to be performed between both the standard libraries and your code. Can we do something similar here? Might make LTO on Clang more worthwhile.

Oct 26 '21 17:10 Keithcat1

llvm-mingw llvm-mingw copied to clipboard

Building with LTO

llvm-mingw
llvm-mingw copied to clipboard