Halide The cmake build should use precompiled headers for Halide.h if possible

@mcourteaux reports this speeds up his builds a lot

Feb 02 '24 21:02 abadams

Okay, lemme have a look at this. On my Ryzen 7 2700X 8-core processor, using Ninja, with 48GB RAM.

Some crappy non-rigorous experimentation

no precompiled headers:

ninja  2894.36s user 360.50s system 1312% cpu 4:07.95 total
total: 3254s

STL includes only precompiled header:

ninja  2787.23s user 354.08s system 1325% cpu 3:57.08 total
total: 3131s

(other process started and is now taking roughly one hyperthread)

STL includes only precompiled header also added to tests:

ninja  2820.56s user 359.06s system 1229% cpu 4:18.58 total`
total: 3179s

STL + commonly used Halide includes in precompiled header:

ninja  2811.51s user 359.64s system 1245% cpu 4:14.56 total
total: 3170s

lean STL subset + commonly used Halide includes in precompiled header:

ninja  2783.97s user 344.10s system 1254% cpu 4:09.27 total
total: 3127s

After some crappy experimentation, let's redo the most promising ones and compare against no new precompiled headers (the tests already have precompiled headers enabled), but all with -j8 for more accurate timing. We will be using the following precompiled header:

#pragma once

// Below is a list of common STL includes in the Halide codebase.
// To find them, you can use:
//
// ag --nofilename --nonumbers --nobreak "#include <" src include | sort | uniq -c | sort -h
// I'm explicitly exluding the test folder, because somebody already set up
// a precompiled header for the test directory.

#include <algorithm>
#include <cstdlib>
#include <functional>
#include <sstream>
#include <cstring>
#include <limits>
#include <set>
#include <stdint.h>
#include <iostream>
#include <cstdint>
#include <utility>
#include <memory>
#include <map>
#include <vector>
#include <string>

lean STL subset precompiled header on both Halide/src and Halide/test (-j8):

ninja -j8  1984.23s user 266.16s system 789% cpu 4:44.88 total
total: 2250s

lean STL subset precompiled header on Halide/test (-j8)

ninja -j8  2025.66s user 269.32s system 791% cpu 4:50.07 total
total: 2295s

lean STL subset precompiled header on Halide/src (-j8)

ninja -j8  1872.60s user 251.90s system 790% cpu 4:28.68 total
total: 2123s

no new precompiled header (-j8)

ninja -j8  1942.76s user 262.02s system 791% cpu 4:38.57 total
total: 2204s

So yeah, this is not really worth it. Most of the build time is spent on the tests either way. I think it does speed up the compilation time of the core Halide.a quite a bit: 45s on that part is significant IMO.

Let's do one more test for the giggles, and run it with mold as the linker.

lean STL subset precompiled header on both Halide/src and Halide/test (-j8):

mold --run ninja -j8  1882.77s user 240.74s system 783% cpu 4:31.06 total
total: 2123s

lean STL subset precompiled header on Halide/test (-j8)

mold --run ninja -j8  1918.88s user 244.15s system 784% cpu 4:35.72 total
total: 2162s

lean STL subset precompiled header on Halide/src (-j8)

mold --run ninja -j8  1773.02s user 228.46s system 784% cpu 4:15.07 total
total: 2001s

No new precompiled headers (-j8)

mold --run ninja -j8  1856.02s user 237.41s system 785% cpu 4:26.64 total
total: 2093s

Quickly updating to the latest version of mold (from 1.7.1 to 2.4.0) and rerunning the best test gives a similar (slightly worse) result:

lean STL subset precompiled header on Halide/src (-j8)

mold --run ninja -j8  1785.08s user 229.87s system 784% cpu 4:16.80 total
total: 2014s

Putting this all in a table:

Halide/src	Halide/test	mold	Time
			2204s
:ballot_box_with_check:	:ballot_box_with_check:		2250s
:ballot_box_with_check:			2123s
	:ballot_box_with_check:		2295s
		:ballot_box_with_check: (v1.7.1)	2093s
:ballot_box_with_check:	:ballot_box_with_check:	:ballot_box_with_check: (v1.7.1)	2123s
:ballot_box_with_check:		:ballot_box_with_check: (v1.7.1)	2001s
	:ballot_box_with_check:	:ballot_box_with_check: (v1.7.1)	2162s
:ballot_box_with_check:		:ballot_box_with_check: (v2.4.0)	2014s

So we can conclude that enabling a precompiled header consisiting of a handful of most frequently includeded STL headers for the Halide/src folder, gives a 3.6% speedup. Additionally, using mold gets us a 5.7% speedup on top of that. Combined gives this 9.2% compile time improvement. Not much overall.

@abadams What do you think? Do I make the precompiled header a PR? Speedup for me is small. Is there a clear way to check if it helps on the build bots? Opening a PR will trigger the bots, but do we get to see a time elapsed report in the end?

Using mold is probably not straightforward on the bots.

PS:

@mcourteaux reports this speeds up his builds a lot

This statement was made regarding my own project, not Halide.

Feb 15 '24 15:02 mcourteaux

Made a PR, to check timings on build bots.

Feb 15 '24 16:02 mcourteaux

So we can conclude that enabling a precompiled header consisiting of a handful of most frequently includeded STL headers for the Halide/src folder, gives a 3.6% speedup. Additionally, using mold gets us a 5.7% speedup on top of that. Combined gives this 9.2% compile time improvement. Not much overall.

3.6% seems worth it for a modest change that doesn't require new tooling.

I haven't used mold before but would be reluctant to add a new tool requirement unless the improvement is large. Is it faster than lld?

Feb 15 '24 17:02 steven-johnson

Note to self: based on the numbers at https://github.com/rui314/mold, it would be worthwhile to ensure that we always use lld from our captive LLVM, rather than relying on the system linker, which could be gold, which is muuuuch slower.

Feb 15 '24 17:02 steven-johnson

Note to self: based on the numbers at https://github.com/rui314/mold, it would be worthwhile to ensure that we always use lld from our captive LLVM, rather than relying on the system linker, which could be gold, which is muuuuch slower.

...aaaand it looks like trying to override the linker used in CMake is a recipe for pain, at least prior to version 3.29 (which we can't reasonably require yet) :-/

Feb 15 '24 17:02 steven-johnson

...aaaand it looks like trying to override the linker used in CMake is a recipe for pain, at least prior to version 3.29 (which we can't reasonably require yet) :-/

Using mold is as simple as mold --run <your original build command>. It injects itself and intercepts all invocations of the default linker.

Feb 15 '24 17:02 mcourteaux

That would be simple enough if we got a massive speed improvement, but for 5%, I think we want something that is completely transparent. (Especially for a Linux-only solution like mold.)

Feb 15 '24 17:02 steven-johnson

We wouldn't add mold to the build itself, we'd inject -fuse-ld=mold via CMAKE_{C,CXX}_FLAGS in the buildbot config.

Feb 16 '24 23:02 alexreinking

Also, mold is GPLv3 and certain companies won't touch it on account of its licensing, so requiring it isn't an option.

Feb 16 '24 23:02 alexreinking

This can be closed, IMO.

May 28 '24 08:05 mcourteaux