cp2k icon indicating copy to clipboard operation
cp2k copied to clipboard

toolchain: Enable NCCL for COSMA

Open oschuett opened this issue 2 years ago • 8 comments

The first half of #2202.

oschuett avatar Jul 23 '22 12:07 oschuett

The NCCL backend requires that 1 rank : 1 gpu device is used, even if CRAY_CUDA_MPS=1. Could it be the reason of the failing tests? Also, cudaSetDevice(rank) should be called on each rank.

kabicm avatar Jul 23 '22 16:07 kabicm

The NCCL backend requires that 1 rank : 1 gpu device is used,

Is this a restriction of NCCL itself or of COSMA?

The test uses two MPI ranks. It passes when run with two GPUs, ie. each rank gets its own GPU. When run with a single GPU (CUDA_VISIBLE_DEVICES=0) then it fails:

Found 1 GPUs
MPI rank 0 uses GPU #0
MPI rank 1 uses GPU #0
     1 P_Mix/Diag. 0.40E+00    4.1  2886.18660597      -262.4671943378 -2.62E+02
  Decoupling Energy:                                              39.2366465978
  Recoupling Energy:                                             -30.0279617197
  Adding QM/MM electrostatic potential to the Kohn-Sham potential.
[NCCL ERROR]: invalid usage
[NCCL ERROR]: invalid usage
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7f53adf85d21 in ???
#1  0x7f53adf84ef5 in ???
#2  0x7f53adb920bf in ???
#3  0x7f53adb9203b in ???
#4  0x7f53adb71858 in ???
#5  0x7f53d35b7910 in ???
#6  0x7f53d35c338b in ???
#7  0x7f53d35c33f6 in ???
#8  0x7f53d35c36a8 in ???
#9  0x5556a6780c5f in _ZN5cosma3gpu17check_nccl_statusE12ncclResult_t
	at /home/ole/git/cp2k/tools/toolchain/build/COSMA-v2.6.1/src/cosma/gpu/nccl_utils.cpp:17
#10  0x5556aa807f63 in _ZN5cosma3gpu16mpi_to_nccl_commEi
	at /home/ole/git/cp2k/tools/toolchain/build/COSMA-v2.6.1/src/cosma/gpu/nccl_utils.cpp:39
#11  0x5556aa80c94e in _ZN5cosma12communicator20create_communicatorsEi
	at /home/ole/git/cp2k/tools/toolchain/build/COSMA-v2.6.1/src/cosma/communicator.cpp:300
#12  0x5556aa80cc52 in _ZN5cosma12communicatorC2ENS_8StrategyEi
	at /home/ole/git/cp2k/tools/toolchain/build/COSMA-v2.6.1/src/cosma/communicator.cpp:65
#13  0x5556aa7dc987 in _ZSt11make_uniqueIN5cosma12communicatorEJRKNS0_8StrategyERiEENSt9_MakeUniqIT_E15__single_objectEDpOT0_
	at /usr/include/c++/9/bits/unique_ptr.h:857
#14  0x5556aa7dc987 in _ZN5cosma13cosma_contextIdE14register_stateEiNS_8StrategyE
	at /home/ole/git/cp2k/tools/toolchain/build/COSMA-v2.6.1/src/cosma/context.cpp:87

oschuett avatar Jul 23 '22 16:07 oschuett

Thanks Ole for checking. This seems to be the limitation of nccl, even if MPS is enabled, as stated here: https://github.com/NVIDIA/nccl/issues/418, more concretely in this comment:

This is not allowed and it will not work, even if you remove the check. The whole topology detection / graph search would not work with multiple ranks being on the same GPU.

Apparently, from NCCL v2.5 on, NVIDIA added an explicit error when multiple ranks per gpu are used, because in that case, they can't guarantee it will work (might or might not).

Also, these tests are failing at the point where nccl communicators are created, which aligns with this problem.

kabicm avatar Jul 23 '22 16:07 kabicm

Thanks for the link Marko!

This might be a real problem because our CPU code is still predominantly parallelized via MPI. Although we strive to bring OpenMP on par, there is still a long way to go. So with this new constraint users will have to choose between either wasting CPU resources (by using too few MPI ranks) or wasting GPU resources (by not utilizing NCCL).

oschuett avatar Jul 23 '22 17:07 oschuett

I see the problem, although I still think that cp2k+cosma+nccl could outperform cp2k+cosma, at least on multi-gpu per node architectures, especially for gemm-dominant simulations.

Another option is to use gpu-aware MPI, by building cosma with -DCOSMA_WITH_GPU_AWARE_MPI=ON. I am not sure if the gpu-aware MPI has the same limitation as nccl. I think @alazzaro should know more about it.

kabicm avatar Jul 23 '22 21:07 kabicm

I see the problem, although I still think that cp2k+cosma+nccl could outperform cp2k+cosma, at least on multi-gpu per node architectures, especially for gemm-dominant simulations.

Another option is to use gpu-aware MPI, by building cosma with -DCOSMA_WITH_GPU_AWARE_MPI=ON. I am not sure if the gpu-aware MPI has the same limitation as nccl. I think @alazzaro should know more about it.

Back from a short break... Personally, my advice is to avoid any hardware related optimizations in the toolchain. The toolchain should be an entry point to get CP2K up and running on a wide range of systems. Then, we can put a large warning somewhere to say: "please, personalize your options for the hardware specific optimizations" and we can mention NCCL and MPI-gpu aware.

For the specific request:

  1. NCCL is only useful when you have a bad MPI implementation and multi-gpu nodes. Furthermore, there is the limitation on the number of ranks/GPU, so I will definitely advice do not use it for a general CP2K installation, unless COSMA is the main bottleneck (RPA tests?) (multiple installations of CP2K?)
  2. MPI-gpu aware requires that option to be supported and enabled in MPI. As far I can see, MPICH and OpenMPI enable it by default when CUDA is found, so it should be fine. However, I can see good performance only when there is good support...

alazzaro avatar Jul 26 '22 08:07 alazzaro

My comment on RPA: It depends on the system size. If we consider systems of moderate size (100-200 atoms, relevant for simulations), COSMA contributes to ca. 15 % of the computation time (with and without gradients). Only on larger systems (roughly 500 atoms), COSMA will dominate. But then, one could even think of using the low-scaling code which does not make use of COSMA at all.

fstein93 avatar Jul 26 '22 09:07 fstein93

My opinion on these is a bit different:

  • RPA: I think it's important how much time pdgemm takes before communication is optimized, i.e. without COSMA. With COSMA we anyway expect, in the ideal case, that pdgemm time will be reduced and potentially not be dominant anymore. For example, last time we ran RPA, 128 water molecules on 128 GPU nodes on Daint, pdgemm was taking 80% of the total runtime with MKL (and even 90% with cray-libsci), whereas with COSMA it was taking 32% of the total runtime. Here is the full plot, taken from this paper: cosma+costa

  • gpu-aware MPI: even if gpu-aware MPI is automatically enabled when CUDA is present, COSMA would have to be compiled with -DCOSMA_WITH_GPU_AWARE_MPI=ON cmake option to make use of it. Most users will surely not do that if not possible through the toolchain.

kabicm avatar Jul 26 '22 13:07 kabicm

I'm going to close this PR while we're waiting for https://github.com/eth-cscs/COSMA/issues/120.

oschuett avatar Sep 06 '22 12:09 oschuett