nest-simulator icon indicating copy to clipboard operation
nest-simulator copied to clipboard

Segmentation faults with Structural Plasticity in NEST v3.6 onwards

Open neuroady opened this issue 1 year ago • 1 comments

This issue has been opened in reference to a mailing list post. Details about the original post can be found here

Using structural plasticity (SP) with MPI-based simulations leads to spontaneous crashes in NESTv3.6 onward

To Reproduce Steps to reproduce the behavior:

  1. Create an MPI-based script that demonstrates structural plasticity.
  2. Run the script with 32 or more MPI processes
    • fewer MPI processes can also generate a segmentation fault

Expected behavior The simulation will crash hinting that a segmentation fault has occurred.

  • The strerr dump from minimal.py on NESTv3.6 with 32 MPI processes is shown below:

[jsfc114:24182:0:24182] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7473)
==== backtrace (tid:  24182) ====
0 0x000000000003e6f0 __GI___sigaction()  :0
1 0x0000000000655387 nest::Connector<nest::static_synapse<nest::TargetIdentifierPtrRport> >::send()  ???:0
2 0x000000000043cfd6 nest::EventDeliveryManager::deliver_events_<nest::SpikeData>()  event_delivery_manager.cpp:0
3 0x000000000043f29f nest::EventDeliveryManager::deliver_events()  ???:0
4 0x000000000040abaa nest::SimulationManager::update_()  simulation_manager.cpp:0
5 0x00000000000156e6 GOMP_parallel()  /dev/shm/swmanage/jusuf/GCCcore/12.3.0/system-system/gcc-12.3.0/stage3_obj/x86_64-pc-linux-gnu/libgomp/../../../libgomp/parallel.c:178
6 0x00000000000156e6 GOMP_parallel_end()  /dev/shm/swmanage/jusuf/GCCcore/12.3.0/system-system/gcc-12.3.0/stage3_obj/x86_64-pc-linux-gnu/libgomp/../../../libgomp/parallel.c:140
7 0x00000000000156e6 GOMP_parallel()  /dev/shm/swmanage/jusuf/GCCcore/12.3.0/system-system/gcc-12.3.0/stage3_obj/x86_64-pc-linux-gnu/libgomp/../../../libgomp/parallel.c:179
8 0x000000000040c067 nest::SimulationManager::update_()  ???:0
9 0x000000000040c96c nest::SimulationManager::call_update_()  ???:0
10 0x0000000000411129 nest::SimulationManager::run()  ???:0
11 0x00000000003f5d7d nest::run()  ???:0
12 0x00000000003f5e51 nest::simulate()  ???:0
13 0x00000000003b1836 nest::NestModule::SimulateFunction::execute()  ???:0
14 0x00000000000bac21 SLIInterpreter::execute_()  interpret.cc:0
15 0x0000000000030d04 __pyx_pw_12pynestkernel_10NESTEngine_9run()  pynestkernel.cxx:0
16 0x00000000001d5e9c _PyEval_EvalFrameDefault()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/ceval.c:5225
17 0x00000000001d5e9c _PyEval_EvalFrameDefault()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/ceval.c:5226
18 0x00000000001ce50a _PyEval_EvalFrame()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/./Include/internal/pycore_ceval.h:73
19 0x00000000001ce50a _PyEval_Vector()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/ceval.c:6443
20 0x00000000001d6c3a _PyEval_EvalFrameDefault()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/ceval.c:5380
21 0x00000000001ce50a _PyEval_EvalFrame()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/./Include/internal/pycore_ceval.h:73
22 0x00000000001ce50a _PyEval_Vector()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/ceval.c:6443
23 0x00000000002562e1 PyEval_EvalCode()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/ceval.c:1154
24 0x0000000000273443 run_eval_code_obj()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/pythonrun.c:1714
25 0x000000000026fbaa run_mod()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/pythonrun.c:1735
26 0x00000000002851e1 pyrun_file()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/pythonrun.c:1630
27 0x0000000000284054 _PyRun_SimpleFileObject()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/pythonrun.c:440
28 0x0000000000283c24 _PyRun_AnyFileObject()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Python/pythonrun.c:79
29 0x000000000027df4c pymain_run_file_obj()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Modules/main.c:360
30 0x000000000027df4c pymain_run_file()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Modules/main.c:379
31 0x000000000027df4c pymain_run_python()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Modules/main.c:601
32 0x000000000027df4c Py_RunMain()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Modules/main.c:680
33 0x0000000000246c67 Py_BytesMain()  /dev/shm/swmanage/jusuf/Python/3.11.3/GCCcore-12.3.0/Python-3.11.3/Modules/main.c:734
34 0x0000000000029590 __libc_start_call_main()  ???:0
35 0x0000000000029640 __libc_start_main_alias_2()  :0
36 0x000000000040106e _start()  ???:0
=================================
<PSP:r0000028:Backtrace after SIGSEGV (Invalid memory reference):>
<PSP:r0000028:# 0: /p/software/jusuf/stages/2024/software/pscom/5-default-GCCcore-12.3.0/lib/libpscom.so.2(+0xb4e4) [0x1529ccad14e4]>
<PSP:r0000028:# 1: /usr/lib64/libc.so.6(+0x3e6f0) [0x152a4963e6f0]>
<PSP:r0000028:# 2: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/../../../nest/libnest.so.3(_ZN4nest9ConnectorINS_14static_synapseINS_24TargetIdentifierPtrRportEEEE4sendEmmRKSt6vectorIPNS_14ConnectorModelESaIS7_EERNS_5EventE+0x87) [0x152a3bc69387]>
<PSP:r0000028:# 3: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/../../../nest/libnest.so.3(+0x43cfd6) [0x152a3ba50fd6]>
<PSP:r0000028:# 4: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/../../../nest/libnest.so.3(_ZN4nest20EventDeliveryManager14deliver_eventsEm+0x6f) [0x152a3ba5329f]>
<PSP:r0000028:# 5: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/../../../nest/libnest.so.3(+0x40abaa) [0x152a3ba1ebaa]>
<PSP:r0000028:# 6: /p/software/jusuf/stages/2024/software/GCCcore/12.3.0/lib64/libgomp.so.1(GOMP_parallel+0x46) [0x152a406b06e6]>
<PSP:r0000028:# 7: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/../../../nest/libnest.so.3(_ZN4nest17SimulationManager7update_Ev+0x197) [0x152a3ba20067]>
<PSP:r0000028:# 8: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/../../../nest/libnest.so.3(_ZN4nest17SimulationManager12call_update_Ev+0x5dc) [0x152a3ba2096c]>
<PSP:r0000028:# 9: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/../../../nest/libnest.so.3(_ZN4nest17SimulationManager3runERKNS_4TimeE+0x339) [0x152a3ba25129]>
<PSP:r0000028:#10: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/../../../nest/libnest.so.3(_ZN4nest3runERKd+0x9d) [0x152a3ba09d7d]>
<PSP:r0000028:#11: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/../../../nest/libnest.so.3(_ZN4nest8simulateERKd+0x11) [0x152a3ba09e51]>
<PSP:r0000028:#12: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/../../../nest/libnest.so.3(_ZNK4nest10NestModule16SimulateFunction7executeEP14SLIInterpreter+0x36) [0x152a3b9c5836]>
<PSP:r0000028:#13: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/../../../nest/libsli.so.3(_ZN14SLIInterpreter8execute_Em+0x201) [0x152a3b041c21]>
<PSP:r0000028:#14: /p/software/jusuf/stages/2024/software/nest-simulator/3.6-gpsmpi-2023a/lib/python3.11/site-packages/nest/pynestkernel.so(+0x30d04) [0x152a3c1cfd04]>
<PSP:r0000028:#15: /p/software/jusuf/stages/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x41bc) [0x152a49c2ce9c]>
<PSP:r0000028:#16: /p/software/jusuf/stages/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/libpython3.11.so.1.0(+0x1ce50a) [0x152a49c2550a]>
<PSP:r0000028:#17: /p/software/jusuf/stages/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x4f5a) [0x152a49c2dc3a]>
<PSP:r0000028:#18: /p/software/jusuf/stages/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/libpython3.11.so.1.0(+0x1ce50a) [0x152a49c2550a]>
<PSP:r0000028:#19: /p/software/jusuf/stages/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/libpython3.11.so.1.0(PyEval_EvalCode+0xa1) [0x152a49cad2e1]>
readFromPMIClient: lost connection to the PMI client
kvsprovider[23316]: releaseMySelf: wrong message type 3 (PSP_CD_CLIENTREFUSED)
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
readFromPMIClient: lost connection to the PMI client
srun: error: jsfc114: tasks 0-27,29-31: Terminated
srun: error: jsfc114: task 28: Exited with exit code 1
srun: Force Terminated StepId=659919.0

Desktop/Environment (please complete the following information):

  • OS: Linux 5.4.0-204-generic x86_64; HPCs (NEMO, JUSUF)
  • Shell: bash
  • Python-Version: Python 3.8.10, Python 3.9.7 :: Intel Corporation, Python 3.12.3
  • NEST-Version: v3.6, v3.8
  • Installation: using cmake with MPI support

neuroady avatar Jan 15 '25 13:01 neuroady

@neuroady For further debugging of this issue, if it is not solved by the solution for #3489, you may want to compile NEST with the following CMake flags:

  • GCC: -Dwith-debug="-O0 -g -D_GLIBCXX_ASSERTIONS"
  • Clang:
    • -Dwith-debug="-O0 -g -fsanitize=bounds"
    • -Dwith-debug="-O0 -g -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_EXTENSIVE"

They add bounds checks to C++ vectors and the like, even where one only uses []. The second one for Clang seems useful only if one runs PyNEST with the lldb debugger, but then it stops very nicely where things go wrong.

heplesser avatar Jul 14 '25 06:07 heplesser