Trilinos icon indicating copy to clipboard operation
Trilinos copied to clipboard

Slow compile times on vortex

Open jennloe opened this issue 2 years ago • 26 comments

Bug Report

@trilinos/<PackageName> Not sure if this is Belos, Tpetra, or KokkosKernels.

Description

I've had multiple Vortex builds hanging on TSQR recently... Not sure what is happening. But I can just disable it for now, so not a super big deal. @iyamazaki

My module list is:


Currently Loaded Modules:
  1) StdEnv                        (S)   6) bsub-wrapper/1.0              11) lapack/3.8.0-gcc-4.9.3
  2) sparc-tools/python/3.7.9            7) sparc-cmake/3.23.2            12) sparc-dev/cuda-10.1.243_gcc-7.3.1_spmpi-rolling
  3) sparc-tools/exodus/2021.11.26       8) gcc/7.3.1                     13) cmake/3.18.0
  4) sparc-tools/tools/main              9) spectrum-mpi/rolling-release  14) git/2.20.0
  5) sparc-tools/taos/2020.09.04        10) cuda/10.1.243


jennloe avatar Aug 01 '22 17:08 jennloe

Modules loaded via: source ~/Trilinos/cmake/std/atdm/load-env.sh cuda-release

jennloe avatar Aug 01 '22 17:08 jennloe

@jennloe Just to confirm, the build hangs? How long have you waited for TSQR to finish compiling?

jhux2 avatar Aug 01 '22 17:08 jhux2

@trilinos/belos @trilinos/tpetra @trilinos/kokkos-kernels

jhux2 avatar Aug 01 '22 18:08 jhux2

@jhux2 Waited a couple of hours this morning. Think it was longer the previous time. But turning it off via -D TpetraCore_ENABLE_TSQR:BOOL=OFF \ seems to get me past it. Could also be a Vortex issue; I haven't needed to disable that explicitly until after recent updates to the system. That is, I'm seeing this in multiple versions of Trilinos (13.2 and current develop) and having problems with this on Vortex for the first time.

jennloe avatar Aug 01 '22 19:08 jennloe

This seems to be related to a general build time slowdown on Vortex since recent upgrades? My full CMake config script is attached. loadTrilinosGPU.txt cmake_config_cuda.txt

jennloe avatar Aug 02 '22 21:08 jennloe

@jennloe Are you building Serial (?), OpenMP and Cuda backends, double+complex?

cgcgcg avatar Aug 02 '22 21:08 cgcgcg

-D Tpetra_ENABLE_CUDA:BOOL=ON \
-D Kokkos_ENABLE_CUDA:BOOL=ON \
-D Kokkos_ENABLE_CUDA_LAMBDA:BOOL=ON \
-D Tpetra_INST_CUDA:BOOL=ON \
-D TPL_ENABLE_CUDA:BOOL=ON \

From the scripts posted above it's a cuda build

lucbv avatar Aug 02 '22 22:08 lucbv

@cgcgcg I've done that before... without Trilinos needing 14 hours to build.... This is beyond the TSQR problem now.

jennloe avatar Aug 02 '22 22:08 jennloe

Not intentionally building serial. Thought I took that out.

jennloe avatar Aug 02 '22 22:08 jennloe

Well, not sure that you are building serial. I don't know what the logic is. Is this a Trilinos problem or a Vortex problem? Did you to build an older Trilinos to test?

cgcgcg avatar Aug 02 '22 22:08 cgcgcg

So actually, I'm building both current Trilinos develop and Trilinos 13.2 from October. And both are having this problem. @jhux2 and @csiefer2 are talking with SEMS. Framework is seeing a 33% slowdown in build times on their end. It's a bigger conversation now.

jennloe avatar Aug 02 '22 22:08 jennloe

@jennloe btw, you don't need bsub-wrapper 1.0, in fact, that's going to cause problems.

jhux2 avatar Aug 02 '22 22:08 jhux2

@jhux2 Then someone needs to adjust the atdm build script...

jennloe avatar Aug 02 '22 22:08 jennloe

But I'll manually unload it for now.

jennloe avatar Aug 02 '22 22:08 jennloe

I only found that out recently when it caused me some problems

jhux2 avatar Aug 02 '22 22:08 jhux2

First observation -- cmake performance is terrible, apparently from the test compiles done during configure.

jhux2 avatar Aug 02 '22 22:08 jhux2

1 minute, 20 seconds to compile Teuchos_Language.cpp.o.

jhux2 avatar Aug 02 '22 23:08 jhux2

Alright, I did just a stupid main.cpp:

#include <iostream>
int main(int argc, char *argv[]) {

  int three=3;

  std::cout << "three=" << three << std::endl;

}
(~/tmp) time mpicxx main.cpp

real	0m34.623s
user	0m1.426s
sys	0m2.012s
(~/tmp) printenv OMPI_CXX
OMPI_CXX=/home/jhu/trilinos/Trilinos-trilinos/packages/kokkos/bin/nvcc_wrapper
(~/tmp) unset OMPI_CXX
(~/tmp) time mpicxx main.cpp

real	0m5.930s
user	0m0.266s

jhux2 avatar Aug 02 '22 23:08 jhux2

First observation -- cmake performance is terrible, apparently from the test compiles done during configure.

Yep. Noticing that, too. I know Eclipse is CPU-only, but I've run similar builds on Eclipse and everything was SO much faster.

jennloe avatar Aug 03 '22 00:08 jennloe

nvcc_wrapper does about 6 compiles for this file, each of which is >= 3 seconds.

jhux2 avatar Aug 03 '22 00:08 jhux2

Per @crtrott, this is most likely a filesystem issue. As evidence, I copied everything to /tmp on the compute node where I'm compiling. That reduced the nvcc_wrapper (really multiple nvcc/g++) time to 12.7s.

jhux2 avatar Aug 03 '22 00:08 jhux2

Time on kokkos-dev-2:

[crtrott@kokkos-dev-2 compiletime]$ time mpicxx main.cpp
real    0m0.803s
user    0m0.214s
sys     0m0.188s
[crtrott@kokkos-dev-2 compiletime]$ time nvcc -x cu main.cpp
real    0m1.148s
user    0m0.804s
sys     0m0.337s
[crtrott@kokkos-dev-2 compiletime]$ time ~/Kokkos//kokkos/bin/nvcc_wrapper main.cpp
real    0m1.164s
user    0m0.786s
sys     0m0.363s
[crtrott@kokkos-dev-2 compiletime]$ export OMPI_CXX=/home/crtrott/Kokkos//kokkos/bin/nvcc_wrapper
[crtrott@kokkos-dev-2 compiletime]$ time mpicxx main.cpp
real    0m1.232s
user    0m0.798s
sys     0m0.383s
[crtrott@kokkos-dev-2 compiletime]$

crtrott avatar Aug 03 '22 00:08 crtrott

On vortex on a compute node in /tmp/jhu-scratch:

(/tmp/jhu-scratch) time mpicxx main.cpp

real	0m5.910s
user	0m0.256s
sys	0m0.422s
(/tmp/jhu-scratch) time nvcc -x cu main.cpp

real	0m17.676s
user	0m1.301s
sys	0m1.120s
(/tmp/jhu-scratch) export NVCC_WRAPPER_TMPDIR=/tmp/jhu-scratch
(/tmp/jhu-scratch) time ./nvcc_wrapper main.cpp

real	0m18.044s
user	0m1.248s
sys	0m1.213s
(/tmp/jhu-scratch) export OMPI_CXX=/tmp/jhu-scratch/nvcc_wrapper
(/tmp/jhu-scratch) time mpicxx main.cpp

real	0m20.761s
user	0m1.298s
sys	0m1.492s

jhux2 avatar Aug 03 '22 00:08 jhux2

Vortex admins are looking into this. Results of same tests on lassen:

(~/tmp) time mpicxx main.cpp

real	0m0.389s
user	0m0.243s
sys	0m0.093s
(~/tmp) time nvcc -x cu main.cpp

real	0m1.595s
user	0m1.282s
sys	0m0.244s
(~/tmp) time ./nvcc_wrapper main.cpp

real	0m1.603s
user	0m1.262s
sys	0m0.266s
(~/tmp) export OMPI_CXX=~/tmp/nvcc_wrapper
(~/tmp) time mpicxx main.cpp

real	0m0.345s
user	0m0.277s
sys	0m0.046s

jhux2 avatar Aug 03 '22 23:08 jhux2

Wow. That's.... a serious difference

csiefer2 avatar Aug 04 '22 16:08 csiefer2

Something is off with John's TCE wrapper: edit: it's still 3x slower than kokkos-dev (calling real nvcc)

# this is "nvcc" - but it's really just a wrapper by tce
[jjellio@vortex1 ~]$ time nvcc -x cu dummy.cpp 

real	0m22.238s
user	0m1.402s
sys	0m1.290s

# this is the real nvcc
[jjellio@vortex1 ~]$ time /usr/tce/packages/cuda/cuda-10.1.243/nvidia/bin/nvcc  -x cu dummy.cpp 

real	0m3.815s
user	0m1.289s
sys	0m0.740s

# try it again
[jjellio@vortex1 ~]$ time nvcc -x cu dummy.cpp 

real	0m20.189s
user	0m1.498s
sys	0m1.179s

# try it again ...
[jjellio@vortex1 ~]$ time /usr/tce/packages/cuda/cuda-10.1.243/nvidia/bin/nvcc  -x cu dummy.cpp 

real	0m3.740s
user	0m1.534s
sys	0m0.498s

jjellio avatar Aug 08 '22 16:08 jjellio

I believe this has been resolved - I'll close, if this is still an issue (reoppen!), and let's dive into again with Kevin B.

jjellio avatar Nov 22 '22 19:11 jjellio