picongpu icon indicating copy to clipboard operation
picongpu copied to clipboard

Compilation time trace

Open bernhardmgruber opened this issue 2 years ago • 2 comments

As promised the other day, I did a quick time trace of compiling PIConGPU (SPEC example):

image So from the 55s compile time, less than 5s is parsing code and the first round of template instantiations, then 10s of instantiating pending templates, and 35s of optimization (which is mostly inlining the many template instantiations).

If we look closer into the template instantiations:

image This is the instantiation hierarchy for picongpu::PluginController. The selected (light) green block is boost::mpl::copy_if<...>, which then causes an insane amount of further template instantiations.

You can try this yourself by compiling with clang and adding -ftime-trace to the CMAKE_CXX_FLAGS.

bernhardmgruber avatar Sep 21 '21 11:09 bernhardmgruber

@bernhardmgruber Cool! Do you have any suggestions for speeding up compilation based on that data?

PrometheusPi avatar Sep 27 '21 21:09 PrometheusPi

With my limited unterstanding of picongpu, these are the suggestions I can make:

  • Reduce the number of template instantiations. I big portion is MPL here, because it uses pre-C++11 TMP techniques. This should be replaced by a modern TMP library like you already suggested in #1997. Whether that also reduces the amount of generated functions the optimizer has to inline is a different question. If a TMP library only computes types, the runtime stays in the frontend, which I could imagine is most of what MPL does now.
  • Reduce the number of generated functions. This might be a spin off from MPL, but I don't know. This could be improved by refactoring the code base into less nested functions. E.g. if your class has a function m1 which always calls m2 and m1 is called a lot, the compiler will have to always inline both of them, instead of just one function in case m1 and m2 could be merged somehow (e.g. with a default parameter). I could also be that there is some compile time iteration on integers, that could be replaced by a #pragma unroll. It's really hard to say.
  • Parallelize compilation without repeating compilation of the same parts. That is more tricky than it sounds. You should aim for splitting the codebase into more translation units, but pay attention that two TUs don't e.g. reinstantiate the same chain of templates. This can easily happen in case both TUs share some common helper functions. More importantly, two TUs should not generate the same function instantiations again, because than you pay in both TUs and for the linker to get rid of them again :)

bernhardmgruber avatar Sep 28 '21 07:09 bernhardmgruber