Comb MPI_Pack with device memory

Hi,

I have built this on a system with a single GPU, that I would like to share between two MPI ranks (just for the sake of getting things up and running). The build basically follows the ubuntu_nvcc10_gcc8 except adjusted for gcc 10. I built commit e06e54d351f7b31177db89f37b4326c8e96656bd (the latest at the time of writing).

-- The CXX compiler identification is GNU 10.2.0
-- Check for working CXX compiler: /usr/bin/g++-10
-- Check for working CXX compiler: /usr/bin/g++-10 - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- BLT Version: 0.3.0
-- CMake Version: 3.17.5
-- CMake Executable: /home/pearson/software/cmake-3.17.5/bin/cmake
-- Found Git: /usr/bin/git (found version "2.28.0") 
-- Git Support is ON
-- Git Executable: /usr/bin/git
-- Git Version: 2.28.0
-- MPI Support is ON
-- Enable FindMPI:  ON
-- Found MPI_CXX: /home/pearson/software/openmpi-4.0.5/lib/libmpi.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- BLT MPI Compile Flags:  $<$<NOT:$<COMPILE_LANGUAGE:CUDA>>:-pthread>;$<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-pthread>
-- BLT MPI Include Paths:  /home/pearson/software/openmpi-4.0.5/include
-- BLT MPI Libraries:      /home/pearson/software/openmpi-4.0.5/lib/libmpi.so
-- BLT MPI Link Flags:     -Wl,-rpath -Wl,/home/pearson/software/openmpi-4.0.5/lib -Wl,--enable-new-dtags -pthread
-- MPI Executable:       /home/pearson/software/openmpi-4.0.5/bin/mpiexec
-- MPI Num Proc Flag:    -n
-- MPI Command Append:   
-- OpenMP Support is OFF
-- CUDA Support is ON
-- The CUDA compiler identification is NVIDIA 11.1.74
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.1") 
-- CUDA Version:       11.1
-- CUDA Compiler:      /usr/local/cuda/bin/nvcc
-- CUDA Host Compiler: /usr/bin/g++-10
-- CUDA Include Path:  /usr/local/cuda/include
-- CUDA Libraries:     /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib/x86_64-linux-gnu/librt.so
-- CUDA Compile Flags: 
-- CUDA Link Flags:    
-- CUDA Separable Compilation:  ON
-- CUDA Link with NVCC:         
-- HIP Support is OFF
-- HCC Support is OFF
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) 
-- Sphinx support is ON
-- Failed to locate Sphinx executable (missing: SPHINX_EXECUTABLE) 
-- Valgrind support is ON
-- Failed to locate Valgrind executable (missing: VALGRIND_EXECUTABLE) 
-- Uncrustify support is ON
-- Failed to locate Uncrustify executable (missing: UNCRUSTIFY_EXECUTABLE) 
-- AStyle support is ON
-- Failed to locate AStyle executable (missing: ASTYLE_EXECUTABLE) 
-- Cppcheck support is ON
-- Failed to locate Cppcheck executable (missing: CPPCHECK_EXECUTABLE) 
-- ClangQuery support is ON
-- Failed to locate ClangQuery executable (missing: CLANGQUERY_EXECUTABLE) 
-- C Compiler family is GNU
-- Adding optional BLT definitions and compiler flags
-- Enabling all compiler warnings on all targets.
-- Fortran support disabled.
-- CMAKE_C_FLAGS flags are:    -Wall -Wextra 
-- CMAKE_CXX_FLAGS flags are:    -Wall -Wextra 
-- CMAKE_EXE_LINKER_FLAGS flags are:  
-- Google Test Support is ON
-- Google Mock Support is OFF
-- The C compiler identification is GNU 10.2.0
-- Check for working C compiler: /usr/bin/gcc-10
-- Check for working C compiler: /usr/bin/gcc-10 - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Found PythonInterp: /usr/bin/python3.8 (found version "3.8.6") 
-- MPI Enabled
-- Cuda Enabled
-- Configuring done
-- Generating done
-- Build files have been written to: /home/pearson/repos/Comb/build_debian-nvcc11-gcc10

I tried to run it with the following:

~/software/openmpi-4.0.5/bin/mpirun -n 2 bin/comb 10_10_10 -divide 2_1_1 -cuda_aware_mpi -comm enable mpi -exec enable mpi_type -memory enable cuda_device

but I get the following error:

Comb version 0.2.0
Args  bin/comb;10_10_10;-divide;2_1_1;-cuda_aware_mpi;-comm;enable;mpi;-exec;enable;mpi_type;-memory;enable;cuda_device
Started rank 0 of 2
Node deneb
Compiler "/usr/bin/g++-10"
Cuda compiler "/usr/local/cuda/bin/nvcc"
Cuda driver version 11010
Cuda runtime version 11010
GPU 0 visible undefined
Cart coords         0        0        0
Message policy cutoff 200
Post Recv using wait_all method
Post Send using wait_all method
Wait Recv using wait_all method
Wait Send using wait_all method
Num cycles          5
Num vars            1
ghost_widths        1        1        1
sizes              10       10       10
divisions           2        1        1
periodic            0        0        0
division map
map                 0        0        0
map                 5       10       10
map                10                  
Starting test memcpy seq dst Host src Host
Starting test Comm mock Mesh seq Host Buffers seq Host seq Host
Starting test Comm mock Mesh seq Host Buffers mpi_type Host mpi_type Host
comb: /home/pearson/repos/Comb/include/comm_pol_mock.hpp:948: void detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::Irecv(detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::context_type&, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::communicator_type&, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::message_type**, IdxT, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::request_type*): Assertion `buf != nullptr' failed.
comb: /home/pearson/repos/Comb/include/comm_pol_mock.hpp:948: void detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::Irecv(detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::context_type&, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::communicator_type&, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::message_type**, IdxT, detail::MessageGroup<detail::MessageBase::Kind::recv, mock_pol, mpi_type_pol>::request_type*): Assertion `buf != nullptr' failed.
[deneb:23991] *** Process received signal ***
[deneb:23991] Signal: Aborted (6)
[deneb:23991] Signal code:  (-6)
[deneb:23992] *** Process received signal ***
[deneb:23992] Signal: Aborted (6)
[deneb:23992] Signal code:  (-6)
[deneb:23991] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7ffb77f91140]
[deneb:23992] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7fd0a2014140]
[deneb:23992] [ 1] [deneb:23991] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x141)[0x7ffb77ac6db1]
[deneb:23991] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x141)[0x7fd0a1b49db1]
[deneb:23992] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x123)[0x7ffb77ab0537]
[deneb:23991] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x123)[0x7fd0a1b33537]
[deneb:23992] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2540f)[0x7fd0a1b3340f]
[deneb:23992] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2540f)[0x7ffb77ab040f]
[deneb:23991] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x345b2)[0x7fd0a1b425b2]
[deneb:23992] [ 5] bin/comb(+0x32b1d)[0x55baa8e21b1d]
[deneb:23992] [ 6] bin/comb(+0x43508)[0x55baa8e32508]
/lib/x86_64-linux-gnu/libc.so.6(+0x345b2)[0x7ffb77abf5b2]
[deneb:23991] [ 5] bin/comb(+0x32b1d)[0x56455c96ab1d]
[deneb:23991] [ 6] bin/comb(+0x43508)[0x56455c97b508]
[deneb:23991] [ 7] bin/comb(+0x468c6)[0x56455c97e8c6]
[deneb:23991] [ 8] bin/comb(+0x5cdc5)[0x56455c994dc5]
[deneb:23991] [ 9] bin/comb(+0x63303)[0x56455c99b303]
[deneb:23991] [10] bin/comb(+0x635a7)[0x56455c99b5a7]
[deneb:23991] [11] bin/comb(+0xe184)[0x56455c946184]
[deneb:23991] [12] [deneb:23992] [ 7] bin/comb(+0x468c6)[0x55baa8e358c6]
[deneb:23992] [ 8] bin/comb(+0x5cdc5)[0x55baa8e4bdc5]
[deneb:23992] [ 9] bin/comb(+0x63303)[0x55baa8e52303]
[deneb:23992] [10] bin/comb(+0x635a7)[0x55baa8e525a7]
[deneb:23992] [11] bin/comb(+0xe184)[0x55baa8dfd184]
[deneb:23992] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7fd0a1b34cca]
[deneb:23992] [13] bin/comb(+0xf41a)[0x55baa8dfe41a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7ffb77ab1cca]
[deneb:23991] [13] bin/comb(+0xf41a)[0x56455c94741a]
[deneb:23991] *** End of error message ***
[deneb:23992] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node deneb exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I also managed ot run the focused tests:

cd <build>/bin
../../scripts/run_tests.bash 1 ../../scripts/focused_tests.bash

which appears to have worked with the following output:

mpirun -np 1 /home/pearson/repos/Comb/build_debian-nvcc11-gcc10/bin/comb -comm post_recv wait_any -comm post_send wait_any -comm wait_recv wait_any -comm wait_send wait_any 100_100_100 -divide 1_1_1 -periodic 1_1_1 -ghost 1_1_1 -vars 3 -cycles 25 -comm cutoff 250 -omp_threads 10 -exec disable seq -exec enable cuda -memory disable host -memory enable cuda_managed -comm enable mock -comm enable mpi
Comb version 0.2.0
Args  /home/pearson/repos/Comb/build_debian-nvcc11-gcc10/bin/comb;-comm;post_recv;wait_any;-comm;post_send;wait_any;-comm;wait_recv;wait_any;-comm;wait_send;wait_any;100_100_100;-divide;1_1_1;-periodic;1_1_1;-ghost;1_1_1;-vars;3;-cycles;25;-comm;cutoff;250;-omp_threads;10;-exec;disable;seq;-exec;enable;cuda;-memory;disable;host;-memory;enable;cuda_managed;-comm;enable;mock;-comm;enable;mpi
Started rank 0 of 1
Node deneb
Compiler "/usr/bin/g++-10"
Cuda compiler "/usr/local/cuda/bin/nvcc"
Cuda driver version 11010
Cuda runtime version 11010
GPU 0 visible undefined
Not built with openmp, ignoring -omp_threads 10.
Cart coords         0        0        0
Message policy cutoff 250
Post Recv using wait_any method
Post Send using wait_any method
Wait Recv using wait_any method
Wait Send using wait_any method
Num cycles         25
Num vars            3
ghost_widths        1        1        1
sizes             100      100      100
divisions           1        1        1
periodic            1        1        1
division map
map                 0        0        0
map               100      100      100
Starting test memcpy cuda dst Managed src HostPinned
Starting test memcpy cuda dst Managed src Device
Starting test Comm mock Mesh cuda Managed Buffers cuda HostPinned cuda HostPinned
Starting test Comm mpi Mesh cuda Managed Buffers cuda HostPinned cuda HostPinned
done

real    0m1.475s
user    0m1.244s
sys     0m0.146s

Is device memory + MPI + MPI_Type a supported configuration at this time? If so, any advice?

Thanks!

Oct 13 '20 22:10 cwpearson

That shouldn't be breaking like that. I'll see if I can reproduce this crash.

Oct 13 '20 23:10 MrBurmark

I was able to reproduce this issue. It should now be fixed on develop.

Oct 14 '20 17:10 MrBurmark

Hi, thank you. In commit 3c4a1be7b2ff793fabf2b299dfa673e2fae8e86f, I no longer see the crash.

However, what I would like to do is see how the MPI implementation handles types on the GPU, so I am now running

~/software/openmpi-4.0.5/bin/mpirun -n 2 bin/comb 256_256_256 -divide 2_1_1 -comm disable mock -comm enable mpi -exec enable mpi_type -memory disable host -memory enable cuda_device.

It seems that no actual benchmarks are run, as the output is this:

Comb version 0.2.0
Args  bin/comb;256_256_256;-divide;2_1_1;-comm;disable;mock;-comm;enable;mpi;-exec;enable;mpi_type;-memory;disable;host;-memory;enable;cuda_device
Started rank 0 of 2
Node deneb
Compiler "/usr/bin/g++-10"
Cuda compiler "/usr/local/cuda/bin/nvcc"
Cuda driver version 11010
Cuda runtime version 11010
GPU 0 visible undefined
Cart coords         0        0        0
Message policy cutoff 200
Post Recv using wait_all method
Post Send using wait_all method
Wait Recv using wait_all method
Wait Send using wait_all method
Num cycles          5
Num vars            1
ghost_widths        1        1        1
sizes             256      256      256
divisions           2        1        1
periodic            0        0        0
division map
map                 0        0        0
map               128      256      256
map               256

Is this configuration supported?

Oct 15 '20 19:10 cwpearson

It looks like -cuda_aware_mpi got dropped from the command line.

Oct 15 '20 19:10 MrBurmark

I dropped it because I interpreted it to mean that it just enabled some assertions and tests, but now I see that the little benchmarks are referred to as "tests" in the outputs.

-cuda_aware_mpi Assert that you are using a cuda aware mpi implementation and enable tests that pass cuda device or managed memory to MPI

In any case, I tried with it on:

$ ~/software/openmpi-4.0.5/bin/mpirun -n 2 bin/comb 256_256_256 -divide 2_1_1 -comm disable mock -comm enable mpi -exec enable mpi_type -memory disable host -memory enable cuda_device -cuda_aware_mpi

Comb version 0.2.0
Args  bin/comb;256_256_256;-divide;2_1_1;-comm;disable;mock;-comm;enable;mpi;-exec;enable;mpi_type;-memory;disable;host;-memory;enable;cuda_device;-cuda_aware_mpi
Started rank 0 of 2
Node deneb
Compiler "/usr/bin/g++-10"
Cuda compiler "/usr/local/cuda/bin/nvcc"
Cuda driver version 11010
Cuda runtime version 11010
GPU 0 visible undefined
Cart coords         0        0        0
Message policy cutoff 200
Post Recv using wait_all method
Post Send using wait_all method
Wait Recv using wait_all method
Wait Send using wait_all method
Num cycles          5
Num vars            1
ghost_widths        1        1        1
sizes             256      256      256
divisions           2        1        1
periodic            0        0        0
division map
map                 0        0        0
map               128      256      256
map               256

Oct 16 '20 12:10 cwpearson

Comb Comb copied to clipboard

MPI_Pack with device memory

Comb
Comb copied to clipboard