Halide icon indicating copy to clipboard operation
Halide copied to clipboard

The cmake build should use precompiled headers for Halide.h if possible

Open abadams opened this issue 1 year ago • 9 comments

@mcourteaux reports this speeds up his builds a lot

abadams avatar Feb 02 '24 21:02 abadams

Okay, lemme have a look at this. On my Ryzen 7 2700X 8-core processor, using Ninja, with 48GB RAM.

Some crappy non-rigorous experimentation
  • no precompiled headers:
    ninja  2894.36s user 360.50s system 1312% cpu 4:07.95 total
    total: 3254s
    
  • STL includes only precompiled header:
    ninja  2787.23s user 354.08s system 1325% cpu 3:57.08 total
    total: 3131s
    

(other process started and is now taking roughly one hyperthread)

  • STL includes only precompiled header also added to tests:
    ninja  2820.56s user 359.06s system 1229% cpu 4:18.58 total`
    total: 3179s
    
  • STL + commonly used Halide includes in precompiled header:
    ninja  2811.51s user 359.64s system 1245% cpu 4:14.56 total
    total: 3170s
    
  • lean STL subset + commonly used Halide includes in precompiled header:
    ninja  2783.97s user 344.10s system 1254% cpu 4:09.27 total
    total: 3127s
    

After some crappy experimentation, let's redo the most promising ones and compare against no new precompiled headers (the tests already have precompiled headers enabled), but all with -j8 for more accurate timing. We will be using the following precompiled header:

#pragma once

// Below is a list of common STL includes in the Halide codebase.
// To find them, you can use:
//
// ag --nofilename --nonumbers --nobreak "#include <" src include | sort | uniq -c | sort -h
// I'm explicitly exluding the test folder, because somebody already set up
// a precompiled header for the test directory.

#include <algorithm>
#include <cstdlib>
#include <functional>
#include <sstream>
#include <cstring>
#include <limits>
#include <set>
#include <stdint.h>
#include <iostream>
#include <cstdint>
#include <utility>
#include <memory>
#include <map>
#include <vector>
#include <string>
  • lean STL subset precompiled header on both Halide/src and Halide/test (-j8):
    ninja -j8  1984.23s user 266.16s system 789% cpu 4:44.88 total
    total: 2250s
    
  • lean STL subset precompiled header on Halide/test (-j8)
    ninja -j8  2025.66s user 269.32s system 791% cpu 4:50.07 total
    total: 2295s
    
  • lean STL subset precompiled header on Halide/src (-j8)
    ninja -j8  1872.60s user 251.90s system 790% cpu 4:28.68 total
    total: 2123s
    
  • no new precompiled header (-j8)
    ninja -j8  1942.76s user 262.02s system 791% cpu 4:38.57 total
    total: 2204s
    

So yeah, this is not really worth it. Most of the build time is spent on the tests either way. I think it does speed up the compilation time of the core Halide.a quite a bit: 45s on that part is significant IMO.

Let's do one more test for the giggles, and run it with mold as the linker.

  • lean STL subset precompiled header on both Halide/src and Halide/test (-j8):

    mold --run ninja -j8  1882.77s user 240.74s system 783% cpu 4:31.06 total
    total: 2123s
    
  • lean STL subset precompiled header on Halide/test (-j8)

    mold --run ninja -j8  1918.88s user 244.15s system 784% cpu 4:35.72 total
    total: 2162s
    
  • lean STL subset precompiled header on Halide/src (-j8)

    mold --run ninja -j8  1773.02s user 228.46s system 784% cpu 4:15.07 total
    total: 2001s
    
  • No new precompiled headers (-j8)

    mold --run ninja -j8  1856.02s user 237.41s system 785% cpu 4:26.64 total
    total: 2093s
    

Quickly updating to the latest version of mold (from 1.7.1 to 2.4.0) and rerunning the best test gives a similar (slightly worse) result:

  • lean STL subset precompiled header on Halide/src (-j8)
    mold --run ninja -j8  1785.08s user 229.87s system 784% cpu 4:16.80 total
    total: 2014s
    

Putting this all in a table:

Halide/src Halide/test mold Time
2204s
:ballot_box_with_check: :ballot_box_with_check: 2250s
:ballot_box_with_check: 2123s
:ballot_box_with_check: 2295s
:ballot_box_with_check: (v1.7.1) 2093s
:ballot_box_with_check: :ballot_box_with_check: :ballot_box_with_check: (v1.7.1) 2123s
:ballot_box_with_check: :ballot_box_with_check: (v1.7.1) 2001s
:ballot_box_with_check: :ballot_box_with_check: (v1.7.1) 2162s
:ballot_box_with_check: :ballot_box_with_check: (v2.4.0) 2014s

So we can conclude that enabling a precompiled header consisiting of a handful of most frequently includeded STL headers for the Halide/src folder, gives a 3.6% speedup. Additionally, using mold gets us a 5.7% speedup on top of that. Combined gives this 9.2% compile time improvement. Not much overall.

@abadams What do you think? Do I make the precompiled header a PR? Speedup for me is small. Is there a clear way to check if it helps on the build bots? Opening a PR will trigger the bots, but do we get to see a time elapsed report in the end?

Using mold is probably not straightforward on the bots.


PS:

@mcourteaux reports this speeds up his builds a lot

This statement was made regarding my own project, not Halide.

mcourteaux avatar Feb 15 '24 15:02 mcourteaux

Made a PR, to check timings on build bots.

mcourteaux avatar Feb 15 '24 16:02 mcourteaux

So we can conclude that enabling a precompiled header consisiting of a handful of most frequently includeded STL headers for the Halide/src folder, gives a 3.6% speedup. Additionally, using mold gets us a 5.7% speedup on top of that. Combined gives this 9.2% compile time improvement. Not much overall.

3.6% seems worth it for a modest change that doesn't require new tooling.

I haven't used mold before but would be reluctant to add a new tool requirement unless the improvement is large. Is it faster than lld?

steven-johnson avatar Feb 15 '24 17:02 steven-johnson

Note to self: based on the numbers at https://github.com/rui314/mold, it would be worthwhile to ensure that we always use lld from our captive LLVM, rather than relying on the system linker, which could be gold, which is muuuuch slower.

steven-johnson avatar Feb 15 '24 17:02 steven-johnson

Note to self: based on the numbers at https://github.com/rui314/mold, it would be worthwhile to ensure that we always use lld from our captive LLVM, rather than relying on the system linker, which could be gold, which is muuuuch slower.

...aaaand it looks like trying to override the linker used in CMake is a recipe for pain, at least prior to version 3.29 (which we can't reasonably require yet) :-/

steven-johnson avatar Feb 15 '24 17:02 steven-johnson

...aaaand it looks like trying to override the linker used in CMake is a recipe for pain, at least prior to version 3.29 (which we can't reasonably require yet) :-/

Using mold is as simple as mold --run <your original build command>. It injects itself and intercepts all invocations of the default linker.

mcourteaux avatar Feb 15 '24 17:02 mcourteaux

That would be simple enough if we got a massive speed improvement, but for 5%, I think we want something that is completely transparent. (Especially for a Linux-only solution like mold.)

steven-johnson avatar Feb 15 '24 17:02 steven-johnson

We wouldn't add mold to the build itself, we'd inject -fuse-ld=mold via CMAKE_{C,CXX}_FLAGS in the buildbot config.

alexreinking avatar Feb 16 '24 23:02 alexreinking

Also, mold is GPLv3 and certain companies won't touch it on account of its licensing, so requiring it isn't an option.

alexreinking avatar Feb 16 '24 23:02 alexreinking

This can be closed, IMO.

mcourteaux avatar May 28 '24 08:05 mcourteaux