Nalu-wind build times on ORNL Summit for GPU builds
I am creating this issue to track nalu-wind build times on ORNL Summit GPU builds. Currently, Summit login nodes will kill a nalu-wind build job if it uses more than 8 parallel jobs (i.e., -j 8 or lower will succeed, but -j 16 or such will die with
nvcc error: 'cicc' killed due to signal 9
For comparison, this doesn't happen when building trilinos where I am able to use a default of -j 16.
A fresh build of nalu-wind on summit takes approx ~45 mins for 546 files when restricted to 8 parallel jobs. Using the .ninja_log here is a summary of the timings
Nalu-Wind build stats
count 553.000000
mean 36.128407
std 19.836133
min 0.030000
25% 24.243000
50% 37.031000
75% 46.994000
max 118.994000
And for comparison, here are the stats for Trilinos build with the same set of modules and configuration:
Trilinos build stats
count 2706.000000
mean 7.663079
std 11.996461
min 0.018000
25% 0.274250
50% 2.564000
75% 11.111500
max 131.237000

@sayerhs That's great, I'm glad you were able to get the per-target build times. It looks like the ngp_algorithms files are the worst offenders. I think any compilation-unit with cuda code has to be compiled twice, and then templating probably compounds that. I wonder if we have any ngp kernels in headers that can be moved to .C files, or if we have too much code in headers in general, and could move more to .C files. Or, perhaps we have some .C files that are including un-needed headers...
@sayerhs One more thing: the stk ngp stuff could be a culprit, it is all header-only, and perhaps much of it doesn't need to be header-only. I'll look into that.
@sayerhs Just from browsing the code, I wonder if we could reduce build times by splitting NgpLoopUtils.h into several separate headers. For instance I see that many .C files call run_entity_algorithm but not run_entity_par_reduce or vice-versa, and same with the run_face_elem_ variants. Also, some .C files call a run_entity_ variant but not run_face_elem_ and vice-versa. So if we split up NgpLoopUtils.h so that each .C file only includes what it needs, that might help.
@alanw0 I did some experiments on build times for src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C. Just compiling that file alone again and again gives approximately ~122s for the build in .ninja_log.
Next, I removed all the templated functions in include/ngp_utils/NgpLoopUtils.h except run_face_elem_algorithm (and things called by it). This still gave build time for this file around ~120s.
Then, after resetting back to master, I commented out this line https://github.com/Exawind/nalu-wind/blob/78ebed21e86e90c039d4765a9c325fe049d890eb/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C#L159 and attempted a build again. That gave me a build time of ~40.8s.
Applying this following patch (on a clean master) and building took ~70s.
diff --git a/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C b/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
index f8ae4e3..98ae4c4 100644
--- a/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
+++ b/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
@@ -156,7 +156,8 @@ NodalGradPOpenBoundary<AlgTraits>::execute()
);
}
-INSTANTIATE_KERNEL_FACE_ELEMENT(NodalGradPOpenBoundary)
+//INSTANTIATE_KERNEL_FACE_ELEMENT(NodalGradPOpenBoundary)
+template class NodalGradPOpenBoundary<AlgTraitsQuad4Hex8>;
} // namespace nalu
} // namespace Sierra
The following patch took ~105.17s
diff --git a/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C b/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
index f8ae4e3..aca67fa 100644
--- a/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
+++ b/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C
@@ -156,7 +156,7 @@ NodalGradPOpenBoundary<AlgTraits>::execute()
);
}
-INSTANTIATE_KERNEL_FACE_ELEMENT(NodalGradPOpenBoundary)
+INSTANTIATE_KERNEL_FACE_ELEMENT_3D(NodalGradPOpenBoundary)
} // namespace nalu
} // namespace Sierra
@sayerhs that's interesting. In general, are we instantiating more than we need to? For debugging a particular case, can you still link if you only instantiate for the element type that you know you will need?
@alanw0 Since we use stk::mesh::Part::topology() at runtime to determine which exact instantiation we will use, we will need all the possible combinations instantiated. See the following functions for example
https://github.com/Exawind/nalu-wind/blob/78ebed21e86e90c039d4765a9c325fe049d890eb/include/ngp_utils/NgpCreateElemInstance.h#L121-L125
https://github.com/Exawind/nalu-wind/blob/78ebed21e86e90c039d4765a9c325fe049d890eb/include/kernel/KernelBuilder.h#L168-L169
What is curious to me is when I commented out the line https://github.com/Exawind/nalu-wind/blob/78ebed21e86e90c039d4765a9c325fe049d890eb/src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C#L159 it still took ~40s to compile that file. That seems a bit high. So I am wondering if we are pulling in way too many headers than we need to.
Another thing to do would be to follow the Tpetra idea of Alg_{def,decl,fwd}.hpp and then use separate files for ETI. This will allow ninja to do more parallel builds and will prevent summit from killing cicc because it is consuming too much memory per file.
@sayerhs a quick question regarding the build time, can you see if there are weak symbols in the library and if so how many? These indicate that some function or class is being recompiled in each file even if it already exists with the exact same template parameters somewhere else. We had that issue in MueLu, fixing it reduced both build time and size.
@lucbv that's a good suggestion. I'll take a look at it.
Closing for now due to lack of updates. We can re-open if this is still a problem.