Trilinos
Trilinos copied to clipboard
Slow compile times on vortex
Bug Report
@trilinos/<PackageName> Not sure if this is Belos, Tpetra, or KokkosKernels.
Description
I've had multiple Vortex builds hanging on TSQR recently... Not sure what is happening. But I can just disable it for now, so not a super big deal. @iyamazaki
My module list is:
Currently Loaded Modules:
1) StdEnv (S) 6) bsub-wrapper/1.0 11) lapack/3.8.0-gcc-4.9.3
2) sparc-tools/python/3.7.9 7) sparc-cmake/3.23.2 12) sparc-dev/cuda-10.1.243_gcc-7.3.1_spmpi-rolling
3) sparc-tools/exodus/2021.11.26 8) gcc/7.3.1 13) cmake/3.18.0
4) sparc-tools/tools/main 9) spectrum-mpi/rolling-release 14) git/2.20.0
5) sparc-tools/taos/2020.09.04 10) cuda/10.1.243
Modules loaded via:
source ~/Trilinos/cmake/std/atdm/load-env.sh cuda-release
@jennloe Just to confirm, the build hangs? How long have you waited for TSQR to finish compiling?
@trilinos/belos @trilinos/tpetra @trilinos/kokkos-kernels
@jhux2 Waited a couple of hours this morning. Think it was longer the previous time. But turning it off via -D TpetraCore_ENABLE_TSQR:BOOL=OFF \
seems to get me past it. Could also be a Vortex issue; I haven't needed to disable that explicitly until after recent updates to the system. That is, I'm seeing this in multiple versions of Trilinos (13.2 and current develop) and having problems with this on Vortex for the first time.
This seems to be related to a general build time slowdown on Vortex since recent upgrades? My full CMake config script is attached. loadTrilinosGPU.txt cmake_config_cuda.txt
@jennloe Are you building Serial (?), OpenMP and Cuda backends, double+complex
-D Tpetra_ENABLE_CUDA:BOOL=ON \
-D Kokkos_ENABLE_CUDA:BOOL=ON \
-D Kokkos_ENABLE_CUDA_LAMBDA:BOOL=ON \
-D Tpetra_INST_CUDA:BOOL=ON \
-D TPL_ENABLE_CUDA:BOOL=ON \
From the scripts posted above it's a cuda build
@cgcgcg I've done that before... without Trilinos needing 14 hours to build.... This is beyond the TSQR problem now.
Not intentionally building serial. Thought I took that out.
Well, not sure that you are building serial. I don't know what the logic is. Is this a Trilinos problem or a Vortex problem? Did you to build an older Trilinos to test?
So actually, I'm building both current Trilinos develop and Trilinos 13.2 from October. And both are having this problem. @jhux2 and @csiefer2 are talking with SEMS. Framework is seeing a 33% slowdown in build times on their end. It's a bigger conversation now.
@jennloe btw, you don't need bsub-wrapper 1.0, in fact, that's going to cause problems.
@jhux2 Then someone needs to adjust the atdm build script...
But I'll manually unload it for now.
I only found that out recently when it caused me some problems
First observation -- cmake performance is terrible, apparently from the test compiles done during configure.
1 minute, 20 seconds to compile Teuchos_Language.cpp.o
.
Alright, I did just a stupid main.cpp:
#include <iostream>
int main(int argc, char *argv[]) {
int three=3;
std::cout << "three=" << three << std::endl;
}
(~/tmp) time mpicxx main.cpp
real 0m34.623s
user 0m1.426s
sys 0m2.012s
(~/tmp) printenv OMPI_CXX
OMPI_CXX=/home/jhu/trilinos/Trilinos-trilinos/packages/kokkos/bin/nvcc_wrapper
(~/tmp) unset OMPI_CXX
(~/tmp) time mpicxx main.cpp
real 0m5.930s
user 0m0.266s
First observation -- cmake performance is terrible, apparently from the test compiles done during configure.
Yep. Noticing that, too. I know Eclipse is CPU-only, but I've run similar builds on Eclipse and everything was SO much faster.
nvcc_wrapper
does about 6 compiles for this file, each of which is >= 3 seconds.
Per @crtrott, this is most likely a filesystem issue. As evidence, I copied everything to /tmp
on the compute node where I'm compiling. That reduced the nvcc_wrapper (really multiple nvcc/g++) time to 12.7s.
Time on kokkos-dev-2:
[crtrott@kokkos-dev-2 compiletime]$ time mpicxx main.cpp
real 0m0.803s
user 0m0.214s
sys 0m0.188s
[crtrott@kokkos-dev-2 compiletime]$ time nvcc -x cu main.cpp
real 0m1.148s
user 0m0.804s
sys 0m0.337s
[crtrott@kokkos-dev-2 compiletime]$ time ~/Kokkos//kokkos/bin/nvcc_wrapper main.cpp
real 0m1.164s
user 0m0.786s
sys 0m0.363s
[crtrott@kokkos-dev-2 compiletime]$ export OMPI_CXX=/home/crtrott/Kokkos//kokkos/bin/nvcc_wrapper
[crtrott@kokkos-dev-2 compiletime]$ time mpicxx main.cpp
real 0m1.232s
user 0m0.798s
sys 0m0.383s
[crtrott@kokkos-dev-2 compiletime]$
On vortex on a compute node in /tmp/jhu-scratch
:
(/tmp/jhu-scratch) time mpicxx main.cpp
real 0m5.910s
user 0m0.256s
sys 0m0.422s
(/tmp/jhu-scratch) time nvcc -x cu main.cpp
real 0m17.676s
user 0m1.301s
sys 0m1.120s
(/tmp/jhu-scratch) export NVCC_WRAPPER_TMPDIR=/tmp/jhu-scratch
(/tmp/jhu-scratch) time ./nvcc_wrapper main.cpp
real 0m18.044s
user 0m1.248s
sys 0m1.213s
(/tmp/jhu-scratch) export OMPI_CXX=/tmp/jhu-scratch/nvcc_wrapper
(/tmp/jhu-scratch) time mpicxx main.cpp
real 0m20.761s
user 0m1.298s
sys 0m1.492s
Vortex admins are looking into this. Results of same tests on lassen:
(~/tmp) time mpicxx main.cpp
real 0m0.389s
user 0m0.243s
sys 0m0.093s
(~/tmp) time nvcc -x cu main.cpp
real 0m1.595s
user 0m1.282s
sys 0m0.244s
(~/tmp) time ./nvcc_wrapper main.cpp
real 0m1.603s
user 0m1.262s
sys 0m0.266s
(~/tmp) export OMPI_CXX=~/tmp/nvcc_wrapper
(~/tmp) time mpicxx main.cpp
real 0m0.345s
user 0m0.277s
sys 0m0.046s
Wow. That's.... a serious difference
Something is off with John's TCE wrapper: edit: it's still 3x slower than kokkos-dev (calling real nvcc)
# this is "nvcc" - but it's really just a wrapper by tce
[jjellio@vortex1 ~]$ time nvcc -x cu dummy.cpp
real 0m22.238s
user 0m1.402s
sys 0m1.290s
# this is the real nvcc
[jjellio@vortex1 ~]$ time /usr/tce/packages/cuda/cuda-10.1.243/nvidia/bin/nvcc -x cu dummy.cpp
real 0m3.815s
user 0m1.289s
sys 0m0.740s
# try it again
[jjellio@vortex1 ~]$ time nvcc -x cu dummy.cpp
real 0m20.189s
user 0m1.498s
sys 0m1.179s
# try it again ...
[jjellio@vortex1 ~]$ time /usr/tce/packages/cuda/cuda-10.1.243/nvidia/bin/nvcc -x cu dummy.cpp
real 0m3.740s
user 0m1.534s
sys 0m0.498s
I believe this has been resolved - I'll close, if this is still an issue (reoppen!), and let's dive into again with Kevin B.