llvm-mingw icon indicating copy to clipboard operation
llvm-mingw copied to clipboard

Building with LTO

Open longnguyen2004 opened this issue 4 years ago • 6 comments

This would allow the compiler to further optimize LLVM and increase the speed of the resulting toolchain (maybe even reduce the size?). Is this possible?

longnguyen2004 avatar Mar 21 '21 02:03 longnguyen2004

Yes, but it comes at a noticable cost.

First off, if you're e.g. on linux and building just a cross compiler, then the compiler itself will be built by your distribution's default compiler. In the case of Ubuntu 18.04, this is GCC 7. With the newly built clang, one could recompile LLVM, letting the new clang optimize the toolchain being built. This would be called a 2-stage build. This in itself roughly doubles the compilation time if building just a cross compiler. If building a toolchain that is going to be running on Windows, this already happens (as one first builds the cross compiler, and then uses it to build a toolchain for WIndows).

On Ubuntu 18.04, a Clang 12.0 compiled with itself is around 12% faster than when it's built by the system GCC 7.

Let's first rehash a few basics; if building with LTO (with Clang), you're shifting work done from compile commands to the link command. (Each compilation command stops after emitting the LLVM IR, deferring code generation and other optimizations to later.)

LLVM LTO has two modes, full LTO and thin LTO. In full LTO, all the IR from all the input object files is compiled as one big module during linking, and all of that runs serially in one thread. If you build one binary this way, the total amount of CPU time used is roughly the same as otherwise, but the loss of parallelization makes it take way longer in wallclock time.

With thin LTO, the link time code generation and optimization is split into a number of parallel modules, so it can use threading. So if you build just one binary with Thin LTO on a multicore system, it takes roughly as long as it would without LTO. Thin LTO also has some sort of cache for parts of the intermediate products, to speed up incremental linking.

However, that cache doesn't seem to have a whole lot of effect when you're building lots of tools that all use some subset of the LLVM libraries, as each of them will do link-time code generation of parts of the build that don't match what the cache has from linking other tools.

So when building LLVM, which consists of a number of separate tools, enabling LTO is kinda costly as it runs LTO code generation individually for each of the N tools that are linked while building LLVM. (As thin LTO linking is threaded itself, the LLVM CMake build system forces linking to only be done with 2 linker jobs in parallel, if using thin LTO.)

A Clang 12.0 built with itself with LTO is around 4-6% faster than one built without it. It doesn't seem to end up smaller though, on the contrary it's a bit bigger.

However, when it comes to time to build it, it sure is kinda costly. On a system with a lot of cores:

Built with Wallclock CPU time Binary size Runtime speedup over the previous
GCC 7 4 min 36 sec 342 min 42 sec 100 MB -
Clang 12, No LTO 4 min 6 sec 308 min 30 sec 82 MB 12%
Clang 12, Thin LTO 15 min 19 sec 788 min 2 sec 98 MB 4-6%
Clang 12, Full LTO 26 min 10 sec 622 min 9 sec 98 MB 1%

So while it does make things faster, the cost for building it is quite high.

I pushed a commit that lets you enable this yourself if you want to, 4012843f3e4dfb08168b545660f29e05c68656cc, where you can pass --stage2, --thinlto and --lto to build-llvm.sh. One way of doing it would e.g. to build a full cross compiler using ./build-all.sh <destination>, followed by a ./build-llvm <destination> --stage2 --thinlto, to rebuild clang using the clang in that directory, with thin LTO. If cross-building a toolchain to run on windows, you should be able to just add --thinlto to the build-llvm.sh call that builds the LLVM tools for windows.

mstorsjo avatar Mar 22 '21 12:03 mstorsjo

Now that we're using LLVM_LINK_LLVM_DYLIB, wouldn't the build-time cost of LTO be reduced quite a bid?

Keithcat1 avatar Aug 17 '21 18:08 Keithcat1

Now that we're using LLVM_LINK_LLVM_DYLIB, wouldn't the build-time cost of LTO be reduced quite a bid?

That's quite possible, yes - in theory I guess it should end up with close to no duplicated work due to LTO, compared to when using static libraries. The actual benefit from LTO on the final binary might be a bit smaller than in a fully static case though.

mstorsjo avatar Aug 17 '21 18:08 mstorsjo

I imagine that for example, all of the LLVM optimizing code is in libLLVM*.dll. Clang probably spends most of its time in there and since it's in the same library, it should be optimized as well as in a static build unless Clang calls into it a lot.

On 8/17/21, Martin Storsjö @.***> wrote:

Now that we're using LLVM_LINK_LLVM_DYLIB, wouldn't the build-time cost of LTO be reduced quite a bid?

That's quite possible, yes - in theory I guess it should end up with close to no duplicated work due to LTO, compared to when using static libraries. The actual benefit from LTO on the final binary might be a bit smaller than in a fully static case though.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/mstorsjo/llvm-mingw/issues/192#issuecomment-900532621

Keithcat1 avatar Aug 17 '21 18:08 Keithcat1

Out of curiosity, I tried to recreate the benchmark above with the currently pinned version (13.0.0 RC1), and it pretty much confirms the observations above:

  • ThinLTO is much less expensive to build with LLVM_LINK_LLVM_DYLIB enabled
  • The actual speedup from building Clang with LTO is still fairly small (smaller in this test than before)
Built with Wallclock CPU time Binary size Runtime speedup over the previous
GCC 7 4 min 38 sec 351 min 14 sec 125 MB -
Clang 13, No LTO 4 min 18 sec 344 min 6 sec 113 MB 7%
Clang 13, Thin LTO 6 min 8 sec 383 min 45 sec 126 MB 1%
Clang 13, Full LTO 32 min 29 sec 328 min 37 sec 126 MB 3%

I also tried to compare to the configuration without dylibs, and with Clang 12. The following numbers are total compilation time for a test subject (QtBase 6.1):

Built with -> GCC Clang Clang ThinLTO
Clang 13 Dylib 2634 sec 2465 sec 2443 sec
Clang 13 2669 sec 2461 sec 2339 sec
Clang 12 2767 sec 2452 sec 2358 sec

So overall in both Clang 12 and 13, ThinLTO does make things faster but not by a whole lot. The difference between being compiled with GCC and Clang is much smaller in Clang 13 than 12.

mstorsjo avatar Aug 18 '21 13:08 mstorsjo

By the way, Rust ships its standard libraries with .llvmbc sections in the object files that contain LLVM bitcode, allowing link time optimization to be performed between both the standard libraries and your code. Can we do something similar here? Might make LTO on Clang more worthwhile.

Keithcat1 avatar Oct 26 '21 17:10 Keithcat1