CoreNeuron
CoreNeuron copied to clipboard
Dynamic MPI library built with OpenACC flags result into crash at the end of simulation
Describe the issue
When dynamic MPI support is enabled, we build lincorenrnmpi_-acc
) then program crashes at the exit handler:
salloc --account=proj16 --partition=prod_p2 --time=08:00:00 --nodes=1 --constraint=v100 --gres=gpu:4 -n 40 --mem 0 --exclusive
module purge
module load unstable nvhpc/21.2 hpe-mpi cuda cmake
git clone --depth 1 [email protected]:neuronsimulator/nrn.git
git clone --depth 1 [email protected]:BlueBrain/CoreNeuron.git
cd CoreNeuron && mkdir BUILD && cd BUILD
cmake -DCORENRN_ENABLE_DYNAMIC_MPI=ON -DCMAKE_CXX_FLAGS="-acc" -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER=nvcc ..
./bin/nrnivmodl-core ../../nrn/test/coreneuron/mod/
srun -n 1 ./x86_64/special-core --mpi -d ../coreneuron/tests/integration/ring
.....
....
Solver Time : 0.0748029
Simulation Statistics
Number of cells: 5
Number of compartments: 115
Number of presyns: 28
Number of input presyns: 0
Number of synapses: 15
Number of point processes: 38
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 9
Number of spikes with non negative gid-s: 9
CoreNEURON run
.....
...
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).
Process ID: 33265, Host: ldir01u09.bbp.epfl.ch, Program: /gpfs/bbp.cscs.ch/home/kumbhar/tmp/x86_64/special.nrn
MPT Version: HPE HMPT 2.22 03/31/20 16:17:35
MPT: --------stack traceback-------
MPT: Attaching to program: /proc/33265/exe, process 33265
MPT: [New LWP 33310]
MPT: [New LWP 33309]
MPT: [New LWP 33283]
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (no debugging symbols found)...done.
....
MPT: done.
MPT: 0x00002aaaad9961d9 in waitpid () from /lib64/libpthread.so.0
MPT: Missing separate debuginfos, use: debuginfo-install bbp-nvidia-driver-470.57.02-2.x86_64 glibc-2.17-324.el7_9.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libibverbs-54mlnx1-1.54103.x86_64 libnl3-3.2.28-4.el7.x86_64 libselinux-2.5-15.el7.x86_64 nss-softokn-freebl-3.53.1-6.el7_9.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64
MPT: (gdb) #0 0x00002aaaad9961d9 in waitpid () from /lib64/libpthread.so.0
MPT: #1 0x00002aaab216a3e6 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7fffffff67d0 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 33265, Host: ldir01u09.bbp.epfl.ch, Program: /gpfs/bbp.cscs.ch/home/kumbhar/tmp/x86_64/special.nrn\n\tMPT Version: HPE HMPT 2.22 03/31/2"...) at sig.c:340
MPT: #3 0x00002aaab216a5d8 in first_arriver_handler (signo=signo@entry=11,
MPT: stack_trace_sem=stack_trace_sem@entry=0x2aaab33e0080) at sig.c:489
MPT: #4 0x00002aaab216a8b3 in slave_sig_handler (signo=11,
MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:565
MPT: #5 <signal handler called>
MPT: #6 0x00002aaaabcc2cd2 in ?? ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #7 0x00002aaaabcc6614 in ?? ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #8 0x00002aaaabcb61bc in ?? ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #9 0x00002aaaabcb7cdb in ?? ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0
MPT: #10 0x00002aaaab984da7 in __pgi_uacc_cuda_unregister_fat_binary (
MPT: pgi_cuda_loc=0x2aaaaacb5a40 <__PGI_CUDA_LOC>) at ../../src/cuda_init.c:649
MPT: #11 0x00002aaaab984d46 in __pgi_uacc_cuda_unregister_fat_binaries ()
MPT: at ../../src/cuda_init.c:635
MPT: #12 0x00002aaaae553ce9 in __run_exit_handlers () from /lib64/libc.so.6
MPT: #13 0x00002aaaae553d37 in exit () from /lib64/libc.so.6
MPT: #14 0x00002aaaab15b264 in hoc_quit () at /root/nrn/src/oc/hoc.cpp:1177
MPT: #15 0x00002aaaab1425f4 in hoc_call () at /root/nrn/src/oc/code.cpp:1389
MPT: #16 0x00002aaab3f7747e in _INTERNAL_37__root_nrn_src_nrnpython_nrnpy_hoc_cpp_629d835d::fcall () at /root/nrn/src/nrnpython/nrnpy_hoc.cpp:692
MPT: #17 0x00002aaaab0ddf35 in OcJump::fpycall ()
MPT: at /root/nrn/src/nrniv/../ivoc/ocjump.cpp:222
To Reproduce
See the instructions above
Expected behavior
With or without -acc flag, shared library should work fine.
System (please complete the following information)
- System/OS: BB5
- Compiler: NVHPC 21.2
- Version: master, just add -acc flag to mpi library as well
- Backend: GPU
Additional context
We should provide a small reproducer to NVIDIA dev forum.