root
root copied to clipboard
Build performance does not scale to many cores/threads
Explain what you would like to see improved
I know that this is a very first world problem, but it has been bugging me since a while. The build of ROOT using its CMake setup is not scaling well to many core systems at all. :frowning:
This is a snapshot of how ROOT 6.20/08 used my system's resources during its build:
The build starts "pretty much" at the left hand side of the timeline, and lasts until "pretty much" the right hand side of it.
As you can see, the build starts out very well. Building LLVM scales perfectly to 64 threads. And I believe it would scale well to even beyond that. But once the LLVM build is done, many bottlenecks show up. First there is a big bottleneck with building libCling
and rootcling
, but after that the build of libRIO
is also taking a surprising amount of time. And the build is stuck waiting for all of these.
Towards the end things improve a bit once more, as many libraries / source files can build in parallel once more. But even then, very rarely does the build manage to make use of all of the available cores.
Optional: share how it could be improved
From a quick glance it seems that ROOT's CMake configuration sets up way too many unnecessary dependencies between its build targets. Most of the issues seem to arise from how the dictionary generation is set up as far as I can see.
In ATLAS I use the following code to set up the generation of dictionary source files:
https://gitlab.cern.ch/atlas/atlasexternals/-/blob/master/Build/AtlasCMake/modules/AtlasDictionaryFunctions.cmake
And that provides a much better behaviour. Mainly because in ATLAS's setup dictionary generations do not need to wait for anything. Even if the library that a dictionary is being produced for depends on a number of upstream libraries, the dictionary for that library can be generated before all the upstream libraries would have finished building. In practice this actually means that the start of any ATLAS software build is dominated by running dictionary generation. As GNU Make and Ninja both prefer running those build steps first. (As they do not have any dependencies themselves.)
The reason I blame the dictionary generation code is that regular C(++) code building with Ninja scales very well to many cores. Even when one has many small libraries in a project, Ninja can start the build of object files before all of the libraries that they depend on would've finished building. (In ATLAS's offline software the very end of a build is taken up purely by library/executable linking steps.)
To Reproduce
Unfortunately you need a pretty powerful machine to do so... But once you do, just do something similar to what I did:
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_STANDARD=17 \
-Dall=ON -Dbuiltin_gsl=ON -Dbuiltin_freetype=ON -Dbuiltin_lzma=ON -Dbuiltin_veccore=ON \
-DXROOTD_ROOT_DIR=~/software/xrootd/4.12.2/x86_64-ubuntu2004-gcc9-opt \
-DTBB_ROOT_DIR=~/software/oneTBB/2020.2/x86_64-ubuntu2004-gcc9-opt \
-DCMAKE_INSTALL_PREFIX=~/software/root/6.20.08/x86_64-ubuntu2004-gcc9-opt ../root-6.20.08/
ninja
Setup
As mentioned earlier, I used ROOT 6.20/08 for this particular test. But the behaviour has been like this since forever. I performed the build on Ubuntu 20.04 with GCC 9, but that should make little difference to the overall behaviour.
Additional context
N/A
I'm aware of this. This is mostly caused by dictionary dependencies. I have a prototype that fixes this; I need to invest some dev time to get this into PR quality. I.e. thanks for the the report, problem acknowledged!
What can be done here is rather simple. The bottleneck last time I checked is rootcling (dictionary generation). There are two reasons:
- cmake -- dictionary generation depends on LinkDef and header files and both artifacts are available from the beginning. However, the cmake build system does not have separate targets for dictionary generation and library generation. That leads to forcing rootcling to wait for the expensive linking step of each library. For example, instead of building
Y.pcm
once we are done with buildingX.pcm
we wait for the linker to linkX.so
. - rootcling is unnecessary slow -- the tool has grown organically and in many cases we make many iterations over the AST where we don't need them. Some of the larger scale ideas have been outlined here since years: https://github.com/root-project/root-evolution/pull/5 In fact, we don't need to get so much into the refactoring rootcling, attaching a profiler and seeing the bottlenecks should be easy. For example, iirc, we make several passes over the AST to harvest the selection rules instead of making a single pass.
This is mostly caused by dictionary dependencies. I have a prototype that fixes this; I need to invest some dev time to get this into PR quality.
Moved on, giving up on this - here's what I ended up with last time I looked at it. I added some comments to explain what's happening.
(It also fixes the "changed a header included by a header that's passed to rootcling" transitional dependency issue...)
I personally do not think that the runtime of rootcling is the problem here, but rather the dependency tree. Of course making anything faster is good.