nest-simulator icon indicating copy to clipboard operation
nest-simulator copied to clipboard

Importing nest built with mpi without mpiexec appears to cause segfault (PyNEST-NG)

Open heplesser opened this issue 11 months ago • 4 comments

When building the PyNEST-NG variant of NEST with MPI support, importing nest appears to lead to segfaults on Linux, see e.g., https://github.com/heplesser/nest-simulator/actions/runs/13130033031/job/36633256531#step:23:213. Invocation under control of mpiexec works. The problem does not occur on macOS.

I have so far observed this only in the testsuite. We need to understand what is going on and hopefully find a solution or at least a work-around.

I mark this as an "Enhancement", not a bug, because it is related to the PyNEST-NG under development.

heplesser avatar Feb 07 '25 11:02 heplesser

Hey @heplesser, I had a look at the issue, and I was able to reproduce the error and also took a loot the core dump associated with the seg-fault.

  • To reproduce just run:

-pytest -v $simple_file_just_import_nest.py -> might cause seg-fault.

#5 __strlen_avx2 () #6 0x00007ab6bbfa50a5 in opal_argv_join () from /lib/x86_64-linux-gnu/libopen-pal.so.40 #7 0x00007ab6bcdba7a2 in ompi_mpi_init () from /lib/x86_64-linux-gnu/libmpi.so.40 #8 0x00007ab6bcd50eec in PMPI_Init_thread () from /lib/x86_64-linux-gnu/libmpi.so.40 #9 0x00007ab6a448d078 in nest::MPIManager::init_mpi (this=0x56388d70bc60, argc=argc@entry=0x7fff39e5a904, argv=argv@entry=0x7fff39e5a908)

By checking the source code of MPI and the implementation of opal_argv_join, this function takes a pointer to argv and a delimiter, and iterates over argv starting from position 1 until reaching a nullptr.

However, in the new init function in pynest/nestkernel_api.pyx, the function does not append a nullptr at the end of the argv`, which will lead to an uninitialized memory access (Undefined behavior).

I don't know the use case of the new init function, but maybe one should take llapi_init_nest as reference to adjust the newly implemented function.

med-ayssar avatar Feb 10 '25 12:02 med-ayssar

@med-ayssar Thanks for your detective work, that could have led to nasty consequences! Do you want to create a PR against my heplesser/pynest-ng-adac branch?

heplesser avatar Feb 10 '25 13:02 heplesser

Yes, I can do that, but could you please explain to me why the new init function? Could we not just use the old implementation again?

med-ayssar avatar Feb 10 '25 13:02 med-ayssar

@heplesser, Done ✅

med-ayssar avatar Feb 10 '25 16:02 med-ayssar

Fixed by https://github.com/heplesser/nest-simulator/pull/30, thus closing.

heplesser avatar Jul 10 '25 20:07 heplesser