nrn icon indicating copy to clipboard operation
nrn copied to clipboard

ASan issues in Cvode code when run in --forked mode

Open olupton opened this issue 2 years ago • 2 comments

Context

In https://github.com/neuronsimulator/nrn/pull/1854 we have seen that several tests that used to pass now fail.

There are errors from UBSan (https://github.com/neuronsimulator/nrn/runs/6985353866?check_suite_focus=true#step:14:6443), ASan (https://github.com/neuronsimulator/nrn/runs/6985353994?check_suite_focus=true#step:14:4632), and crashes.

Overview of the issue

It's not yet clear precisely how the forking affects this. Presumably it somehow changes how sandboxed different tests are from one another.

Focusing on the test_ecs_reinit_cvode test, the issue seems to be that we call CVodeReInit https://github.com/neuronsimulator/nrn/blob/8fecd777a717fc11c22a294533aac7a4493fc273/src/nrncvode/cvodeobj.cpp#L1078 with a longer y_ vector after calling CVodeMalloc https://github.com/neuronsimulator/nrn/blob/8fecd777a717fc11c22a294533aac7a4493fc273/src/nrncvode/cvodeobj.cpp#L1096 with a shorter y_ vector, but that the cleanup in https://github.com/neuronsimulator/nrn/blob/8fecd777a717fc11c22a294533aac7a4493fc273/src/nrncvode/cvodeobj.cpp#L886-L889 is not triggered, so the internal arrays have the original, shorter, length. This causes a buffer overflow when https://github.com/neuronsimulator/nrn/blob/8fecd777a717fc11c22a294533aac7a4493fc273/src/sundials/cvodes/cvodes.c#L690 tries to copy the full length of the new/long y_ into an old/short zn[0].

The above description is of the issue flagged by ASan, it has not yet been confirmed if this is the underlying cause of the UBSan errors and crashes.

Expected result/behavior

Everything should pass, forking should not matter, there should never be any ASan errors.

NEURON setup

  • Version: https://github.com/neuronsimulator/nrn/pull/1854
  • Installation method: CMake build
  • OS + Version: BB5 for ASan debugging, crashes seen on macOS too
  • Compiler + Version: Clang 13 for ASan debugging on BB5, Clang 14 in ASan/UBSan CI, AppleClang on macOS

Minimal working example - MWE

Run rxd tests in CTest having made pytest-forked available (e.g. via pip install)

cmake -G Ninja .. \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DNRN_ENABLE_CORENEURON=ON \
  -DNRN_ENABLE_DOCS=OFF \
  -DNRN_ENABLE_MPI=ON \
  -DNRN_ENABLE_TESTS=ON \
  -DNRN_CLANG_FORMAT=ON \
  -DNRN_CMAKE_FORMAT=ON \
  -DNRN_SANITIZERS=address \
  -DCORENRN_SANITIZERS=address \
  -DCORENRN_ENABLE_MPI=ON \
  -DCORENRN_ENABLE_OPENMP=OFF \
  -DPYTHON_EXECUTABLE=$(command -v python)

cc: @ohm314

olupton avatar Jun 22 '22 16:06 olupton

As far as I can see , the problem is related to this line in Cvode::init_eqn(): https://github.com/neuronsimulator/nrn/blob/8fecd777a717fc11c22a294533aac7a4493fc273/src/nrncvode/occvode.cpp#L165 which appears to very indirectly internally call Cvode::init_eqn() again via: https://github.com/neuronsimulator/nrn/blob/8fecd777a717fc11c22a294533aac7a4493fc273/src/nrnpython/rxd.cpp#L771 and: https://github.com/neuronsimulator/nrn/blob/8fecd777a717fc11c22a294533aac7a4493fc273/src/nrnpython/rxd.cpp#L779 and: https://github.com/neuronsimulator/nrn/blob/c0e2ea9edc2a9fd530d9841f744dc55ea8fe95f4/share/lib/python/neuron/rxd/rxd.py#L593-L602 and the inner call initialises the CVODES structures using a (smaller) intermediate value of neq_ before the outer call completes the calculation of neq_ and fails to re-initialises the structures using the final, larger, value.

The difference between the --forked and "standard" configurations seems to be whether https://github.com/neuronsimulator/nrn/blob/c0e2ea9edc2a9fd530d9841f744dc55ea8fe95f4/share/lib/python/neuron/rxd/rxd.py#L595 returns True or False when the test first reaches it. With --forked, it is True. It appears that this is because https://github.com/neuronsimulator/nrn/blob/c0e2ea9edc2a9fd530d9841f744dc55ea8fe95f4/test/rxd/conftest.py#L69 is not called in the same way when --forked is used.

olupton avatar Jun 23 '22 09:06 olupton

@ramcdougal @adamjhn when nrn_nonvint_block_ode_count requests a count of the additional states that rxd is adding to the system, is it necessary to call _cvode_object.re_init()

nrnhines avatar Jun 24 '22 16:06 nrnhines