The cmake build should use precompiled headers for Halide.h if possible
@mcourteaux reports this speeds up his builds a lot
Okay, lemme have a look at this. On my Ryzen 7 2700X 8-core processor, using Ninja, with 48GB RAM.
Some crappy non-rigorous experimentation
- no precompiled headers:
ninja 2894.36s user 360.50s system 1312% cpu 4:07.95 total total: 3254s - STL includes only precompiled header:
ninja 2787.23s user 354.08s system 1325% cpu 3:57.08 total total: 3131s
(other process started and is now taking roughly one hyperthread)
- STL includes only precompiled header also added to tests:
ninja 2820.56s user 359.06s system 1229% cpu 4:18.58 total` total: 3179s - STL + commonly used Halide includes in precompiled header:
ninja 2811.51s user 359.64s system 1245% cpu 4:14.56 total total: 3170s - lean STL subset + commonly used Halide includes in precompiled header:
ninja 2783.97s user 344.10s system 1254% cpu 4:09.27 total total: 3127s
After some crappy experimentation, let's redo the most promising ones and compare against no new precompiled headers (the tests already have precompiled headers enabled), but all with -j8 for more accurate timing. We will be using the following precompiled header:
#pragma once
// Below is a list of common STL includes in the Halide codebase.
// To find them, you can use:
//
// ag --nofilename --nonumbers --nobreak "#include <" src include | sort | uniq -c | sort -h
// I'm explicitly exluding the test folder, because somebody already set up
// a precompiled header for the test directory.
#include <algorithm>
#include <cstdlib>
#include <functional>
#include <sstream>
#include <cstring>
#include <limits>
#include <set>
#include <stdint.h>
#include <iostream>
#include <cstdint>
#include <utility>
#include <memory>
#include <map>
#include <vector>
#include <string>
- lean STL subset precompiled header on both Halide/src and Halide/test (
-j8):ninja -j8 1984.23s user 266.16s system 789% cpu 4:44.88 total total: 2250s - lean STL subset precompiled header on Halide/test (
-j8)ninja -j8 2025.66s user 269.32s system 791% cpu 4:50.07 total total: 2295s - lean STL subset precompiled header on Halide/src (
-j8)ninja -j8 1872.60s user 251.90s system 790% cpu 4:28.68 total total: 2123s - no new precompiled header (
-j8)ninja -j8 1942.76s user 262.02s system 791% cpu 4:38.57 total total: 2204s
So yeah, this is not really worth it. Most of the build time is spent on the tests either way. I think it does speed up the compilation time of the core Halide.a quite a bit: 45s on that part is significant IMO.
Let's do one more test for the giggles, and run it with mold as the linker.
-
lean STL subset precompiled header on both Halide/src and Halide/test (
-j8):mold --run ninja -j8 1882.77s user 240.74s system 783% cpu 4:31.06 total total: 2123s -
lean STL subset precompiled header on Halide/test (
-j8)mold --run ninja -j8 1918.88s user 244.15s system 784% cpu 4:35.72 total total: 2162s -
lean STL subset precompiled header on Halide/src (
-j8)mold --run ninja -j8 1773.02s user 228.46s system 784% cpu 4:15.07 total total: 2001s -
No new precompiled headers (
-j8)mold --run ninja -j8 1856.02s user 237.41s system 785% cpu 4:26.64 total total: 2093s
Quickly updating to the latest version of mold (from 1.7.1 to 2.4.0) and rerunning the best test gives a similar (slightly worse) result:
- lean STL subset precompiled header on Halide/src (
-j8)mold --run ninja -j8 1785.08s user 229.87s system 784% cpu 4:16.80 total total: 2014s
Putting this all in a table:
| Halide/src | Halide/test | mold | Time |
|---|---|---|---|
| 2204s | |||
| :ballot_box_with_check: | :ballot_box_with_check: | 2250s | |
| :ballot_box_with_check: | 2123s | ||
| :ballot_box_with_check: | 2295s | ||
| :ballot_box_with_check: (v1.7.1) | 2093s | ||
| :ballot_box_with_check: | :ballot_box_with_check: | :ballot_box_with_check: (v1.7.1) | 2123s |
| :ballot_box_with_check: | :ballot_box_with_check: (v1.7.1) | 2001s | |
| :ballot_box_with_check: | :ballot_box_with_check: (v1.7.1) | 2162s | |
| :ballot_box_with_check: | :ballot_box_with_check: (v2.4.0) | 2014s |
So we can conclude that enabling a precompiled header consisiting of a handful of most frequently includeded STL headers for the Halide/src folder, gives a 3.6% speedup. Additionally, using mold gets us a 5.7% speedup on top of that. Combined gives this 9.2% compile time improvement. Not much overall.
@abadams What do you think? Do I make the precompiled header a PR? Speedup for me is small. Is there a clear way to check if it helps on the build bots? Opening a PR will trigger the bots, but do we get to see a time elapsed report in the end?
Using mold is probably not straightforward on the bots.
PS:
@mcourteaux reports this speeds up his builds a lot
This statement was made regarding my own project, not Halide.
Made a PR, to check timings on build bots.
So we can conclude that enabling a precompiled header consisiting of a handful of most frequently includeded STL headers for the Halide/src folder, gives a 3.6% speedup. Additionally, using mold gets us a 5.7% speedup on top of that. Combined gives this 9.2% compile time improvement. Not much overall.
3.6% seems worth it for a modest change that doesn't require new tooling.
I haven't used mold before but would be reluctant to add a new tool requirement unless the improvement is large. Is it faster than lld?
Note to self: based on the numbers at https://github.com/rui314/mold, it would be worthwhile to ensure that we always use lld from our captive LLVM, rather than relying on the system linker, which could be gold, which is muuuuch slower.
Note to self: based on the numbers at
https://github.com/rui314/mold, it would be worthwhile to ensure that we always uselldfrom our captive LLVM, rather than relying on the system linker, which could begold, which is muuuuch slower.
...aaaand it looks like trying to override the linker used in CMake is a recipe for pain, at least prior to version 3.29 (which we can't reasonably require yet) :-/
...aaaand it looks like trying to override the linker used in CMake is a recipe for pain, at least prior to version 3.29 (which we can't reasonably require yet) :-/
Using mold is as simple as mold --run <your original build command>. It injects itself and intercepts all invocations of the default linker.
That would be simple enough if we got a massive speed improvement, but for 5%, I think we want something that is completely transparent. (Especially for a Linux-only solution like mold.)
We wouldn't add mold to the build itself, we'd inject -fuse-ld=mold via CMAKE_{C,CXX}_FLAGS in the buildbot config.
Also, mold is GPLv3 and certain companies won't touch it on account of its licensing, so requiring it isn't an option.
This can be closed, IMO.