Add Profile-Guided Optimization (PGO) support
Hi!
Here I want to discuss the idea of adding to Buck2 explicit support for building applications with Profile-Guided Optimization (PGO). According to many articles and benchmarks, PGO can play a huge role in an application optimization path. However, performing PGO optimization manually via tweaking compiler flags can be error-prone. That's why I think adding such a feature to Buck2 would be helpful.
Some examples of the current situation in other build systems:
- Cargo: No built-in support but there is awesome cargo-pgo
- Bazel: Supports (command-line reference)
- Meson: Supports (
b_pgoin the docs) - CMake: No support yet (GitLab issue)
- SCons: No support yet (GitHub discussion)
Since Facebook is the main LLVM BOLT contributor, more native LLVM BOLT integration into the build system pipeline also could be an interesting idea for the Facebook-powered build system.
I believe the existing prelude does actually have some support for BOLT already, so presumably it could work e.g. with cxx_binary. I don't know how to enable it, though, someone from the toolchains team at Meta would probably have to give an example and chime in here. It does look like you at least need bolt_enabled = true attached to your toolchains//:cxx target, though.
Note that most "instrumentation based profilers" like -fprofile-generate don't fit very well with hermetic, distributed build systems like Buck or Bazel; it's hard to see what Bazel does in the link you provided, but in general it requires backpropagating binary profiles into source input files and completely recompiling them, and that has a lot of consequences (build time, artifact sizes, memory use, remote builds and caching, etc). In contrast BOLT works on final outputs, so it's more general in some sense and doesn't require as much shoe-horning to recompile everything. But it does require reassembly, which can be expensive.
In that same vein, it would be great to see if something like Google's Propeller, which is in upstream LLVM could also be added into the prelude. Propeller is a post-link optimizer like BOLT, but it "relinks" existing binaries rather than "disassembles-than-reassembles" them. It actually operates at a very high-level much more like ThinLTO than BOLT, and unlike BOLT it requires a second recompilation pass for hot code modules — though keep in mind presumably BOLT would have to pass over that same hot code and optimize it anyway, so the main difference goes back to being able to recompile hot modules in parallel with remote builds, so it hopefully scales better in the very very large. For very large binaries, that's sometimes a significant scalability difference (the paper has some numbers). Perhaps both BOLT and Propeller could even complement each other in the end...
I confirm that internally we do use BOLT - and bolt_enabled = true would be step 1. Unfortunately I've no real idea what happens after that. Maybe try it and report back?
Is there documentation for bolt_enabled flag anywhere? At least I am interested in how this kind of BOLT optimization is done: which kind of BOLT optimization is used (via external perf profiles or with the instrumentation mode since BOLT supports both of them), which BOLT flags are used during this step, is it possible to specify/override BOLT flags (like it's done in cargo-pgo here).