CoreNeuron icon indicating copy to clipboard operation
CoreNeuron copied to clipboard

Dynamic MPI library built with OpenACC flags result into crash at the end of simulation

Open pramodk opened this issue 2 years ago • 0 comments

Describe the issue

When dynamic MPI support is enabled, we build lincorenrnmpi_.so library. If this library is built with OpenACC flags (e.g. -acc) then program crashes at the exit handler:

salloc --account=proj16 --partition=prod_p2 --time=08:00:00 --nodes=1 --constraint=v100 --gres=gpu:4 -n 40 --mem 0 --exclusive
module purge
module load unstable nvhpc/21.2  hpe-mpi cuda cmake
git clone --depth 1 [email protected]:neuronsimulator/nrn.git
git clone --depth 1 [email protected]:BlueBrain/CoreNeuron.git
cd CoreNeuron && mkdir BUILD && cd BUILD
cmake -DCORENRN_ENABLE_DYNAMIC_MPI=ON -DCMAKE_CXX_FLAGS="-acc" -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER=nvcc ..
./bin/nrnivmodl-core ../../nrn/test/coreneuron/mod/

srun -n 1 ./x86_64/special-core --mpi -d ../coreneuron/tests/integration/ring
.....
....
Solver Time : 0.0748029


 Simulation Statistics
 Number of cells: 5
 Number of compartments: 115
 Number of presyns: 28
 Number of input presyns: 0
 Number of synapses: 15
 Number of point processes: 38
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 9
 Number of spikes with non negative gid-s: 9
CoreNEURON run
.....
...
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).
	Process ID: 33265, Host: ldir01u09.bbp.epfl.ch, Program: /gpfs/bbp.cscs.ch/home/kumbhar/tmp/x86_64/special.nrn
	MPT Version: HPE HMPT 2.22  03/31/20 16:17:35

MPT: --------stack traceback-------
MPT: Attaching to program: /proc/33265/exe, process 33265
MPT: [New LWP 33310]
MPT: [New LWP 33309]
MPT: [New LWP 33283]
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (no debugging symbols found)...done.
....
MPT: done.
MPT: 0x00002aaaad9961d9 in waitpid () from /lib64/libpthread.so.0
MPT: Missing separate debuginfos, use: debuginfo-install bbp-nvidia-driver-470.57.02-2.x86_64 glibc-2.17-324.el7_9.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libibverbs-54mlnx1-1.54103.x86_64 libnl3-3.2.28-4.el7.x86_64 libselinux-2.5-15.el7.x86_64 nss-softokn-freebl-3.53.1-6.el7_9.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64
MPT: (gdb) #0  0x00002aaaad9961d9 in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaab216a3e6 in mpi_sgi_system (
MPT: #2  MPI_SGI_stacktraceback (
MPT:     header=header@entry=0x7fffffff67d0 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 33265, Host: ldir01u09.bbp.epfl.ch, Program: /gpfs/bbp.cscs.ch/home/kumbhar/tmp/x86_64/special.nrn\n\tMPT Version: HPE HMPT 2.22  03/31/2"...) at sig.c:340
MPT: #3  0x00002aaab216a5d8 in first_arriver_handler (signo=signo@entry=11,
MPT:     stack_trace_sem=stack_trace_sem@entry=0x2aaab33e0080) at sig.c:489
MPT: #4  0x00002aaab216a8b3 in slave_sig_handler (signo=11,
MPT:     siginfo=<optimized out>, extra=<optimized out>) at sig.c:565
MPT: #5  <signal handler called>
MPT: #6  0x00002aaaabcc2cd2 in ?? ()
MPT:    from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #7  0x00002aaaabcc6614 in ?? ()
MPT:    from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #8  0x00002aaaabcb61bc in ?? ()
MPT:    from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #9  0x00002aaaabcb7cdb in ?? ()
MPT:    from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #10 0x00002aaaab984da7 in __pgi_uacc_cuda_unregister_fat_binary (
MPT:     pgi_cuda_loc=0x2aaaaacb5a40 <__PGI_CUDA_LOC>) at ../../src/cuda_init.c:649
MPT: #11 0x00002aaaab984d46 in __pgi_uacc_cuda_unregister_fat_binaries ()
MPT:     at ../../src/cuda_init.c:635
MPT: #12 0x00002aaaae553ce9 in __run_exit_handlers () from /lib64/libc.so.6
MPT: #13 0x00002aaaae553d37 in exit () from /lib64/libc.so.6
MPT: #14 0x00002aaaab15b264 in hoc_quit () at /root/nrn/src/oc/hoc.cpp:1177
MPT: #15 0x00002aaaab1425f4 in hoc_call () at /root/nrn/src/oc/code.cpp:1389
MPT: #16 0x00002aaab3f7747e in _INTERNAL_37__root_nrn_src_nrnpython_nrnpy_hoc_cpp_629d835d::fcall () at /root/nrn/src/nrnpython/nrnpy_hoc.cpp:692
MPT: #17 0x00002aaaab0ddf35 in OcJump::fpycall ()
MPT:     at /root/nrn/src/nrniv/../ivoc/ocjump.cpp:222

To Reproduce

See the instructions above

Expected behavior

With or without -acc flag, shared library should work fine.

System (please complete the following information)

  • System/OS: BB5
  • Compiler: NVHPC 21.2
  • Version: master, just add -acc flag to mpi library as well
  • Backend: GPU

Additional context

We should provide a small reproducer to NVIDIA dev forum.

pramodk avatar Oct 20 '21 21:10 pramodk