SIRF SPECT reconstruction causes error with CIL GradientOperator if OPENMP=ON

Segmentation error (core dumped)

After running minimalCIL.py @ https://github.com/samdporter/SPECTissues.git

CIL's admm causes failure. OSEM reconstruction works OK.

Ran valgrind & will post in next comment

May 03 '22 11:05 samdporter

==85870== 64,167,432 bytes in 126 blocks are still reachable in loss record 7,248 of 7,250 ==85870== at 0x483C583: operator new[](unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==85870== by 0x30673740: stir::ProjMatrixByBinSPECTUB::set_up(std::shared_ptr<stir::ProjDataInfo const> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> const> const&) (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x30682166: stir::ProjectorByBinPair::set_up(std::shared_ptr<stir::ProjDataInfo const> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> const> const&) (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x306826BC: stir::ProjectorByBinPairUsingProjMatrixByBin::set_up(std::shared_ptr<stir::ProjDataInfo const> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> const> const&) (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x3059DB54: sirf::PETAcquisitionModel::set_up(std::shared_ptrsirf::PETAcquisitionData, std::shared_ptrsirf::STIRImageData) (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x3056A3F6: sirf::PETAcquisitionModelUsingMatrix::set_up(std::shared_ptrsirf::PETAcquisitionData, std::shared_ptrsirf::STIRImageData) (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x30554F40: cSTIR_setupAcquisitionModel (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x3052DE33: _wrap_cSTIR_setupAcquisitionModel (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x5F3A29: PyCFunction_Call (in /usr/bin/python3.8) ==85870== by 0x5F3E1D: _PyObject_MakeTpCall (in /usr/bin/python3.8) ==85870== by 0x570673: _PyEval_EvalFrameDefault (in /usr/bin/python3.8) ==85870== by 0x5F6835: _PyFunction_Vectorcall (in /usr/bin/python3.8) ==85870== ==85870== 113,746,225 bytes in 1 blocks are still reachable in loss record 7,249 of 7,250 ==85870== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==85870== by 0x5F8A00: PyBytes_FromStringAndSize (in /usr/bin/python3.8) ==85870== by 0x5F903B: ??? (in /usr/bin/python3.8) ==85870== by 0x646F95: ??? (in /usr/bin/python3.8) ==85870== by 0x5C4675: ??? (in /usr/bin/python3.8) ==85870== by 0x4F290D: ??? (in /usr/bin/python3.8) ==85870== by 0x64F717: ??? (in /usr/bin/python3.8) ==85870== by 0x5048B2: ??? (in /usr/bin/python3.8) ==85870== by 0x56B1D9: _PyEval_EvalFrameDefault (in /usr/bin/python3.8) ==85870== by 0x569399: _PyEval_EvalCodeWithName (in /usr/bin/python3.8) ==85870== by 0x5F6A12: _PyFunction_Vectorcall (in /usr/bin/python3.8) ==85870== by 0x56B1D9: _PyEval_EvalFrameDefault (in /usr/bin/python3.8) ==85870== ==85870== 240,458,752 bytes in 2 blocks are still reachable in loss record 7,250 of 7,250 ==85870== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==85870== by 0x5806F33: default_malloc (in /usr/local/lib/python3.8/dist-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so) ==85870== by 0x580764E: PyDataMem_UserNEW (in /usr/local/lib/python3.8/dist-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so) ==85870== by 0x58644B9: PyArray_NewFromDescr_int (in /usr/local/lib/python3.8/dist-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so) ==85870== by 0x580ABB7: array_new (in /usr/local/lib/python3.8/dist-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so) ==85870== by 0x5F3D02: _PyObject_MakeTpCall (in /usr/bin/python3.8) ==85870== by 0x570AF8: _PyEval_EvalFrameDefault (in /usr/bin/python3.8) ==85870== by 0x569399: _PyEval_EvalCodeWithName (in /usr/bin/python3.8) ==85870== by 0x5F6A12: _PyFunction_Vectorcall (in /usr/bin/python3.8) ==85870== by 0x56C28B: _PyEval_EvalFrameDefault (in /usr/bin/python3.8) ==85870== by 0x5F6835: _PyFunction_Vectorcall (in /usr/bin/python3.8) ==85870== by 0x59DDFE: ??? (in /usr/bin/python3.8) ==85870== ==85870== LEAK SUMMARY: ==85870== definitely lost: 1,816 bytes in 15 blocks ==85870== indirectly lost: 0 bytes in 0 blocks ==85870== possibly lost: 699,830 bytes in 629 blocks ==85870== still reachable: 514,605,355 bytes in 80,204 blocks ==85870== of which reachable via heuristic: ==85870== stdstring : 25,582 bytes in 681 blocks ==85870== newarray : 248,504 bytes in 247 blocks ==85870== suppressed: 0 bytes in 0 blocks ==85870== ==85870== ERROR SUMMARY: 78258 errors from 554 contexts (suppressed: 0 from 0)

Possible memory leak?

May 03 '22 11:05 samdporter

results from gbd:

Thread 34 "python3" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fc781aad700 (LWP 261805)] __gnu_cxx::__atomic_add_dispatch (__val=1, __mem=0x249) at /usr/include/c++/9/ext/atomicity.h:96 96 __atomic_add(__mem, __val);

#0 __gnu_cxx::__atomic_add_dispatch (__val=1, __mem=0x219) at /usr/include/c++/9/ext/atomicity.h:96 #1 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy (this=0x211) at /usr/include/c++/9/bits/shared_ptr_base.h:139 #2 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::operator= (__r=..., this=) at /usr/include/c++/9/bits/shared_ptr_base.h:747 #3 std::__shared_ptr<stir::DiscretisedDensity<3, float>, (__gnu_cxx::_Lock_policy)2>::operator= (this=) at /usr/include/c++/9/bits/shared_ptr_base.h:1080 #4 std::shared_ptr<stir::DiscretisedDensity<3, float> >::operator= (this=) at /usr/include/c++/9/bits/shared_ptr.h:103 #5 stir::BackProjectorByBin::actual_back_project (this=0x3a66aa0, viewgrams=..., min_axial_pos_num=0, max_axial_pos_num=0, min_tangential_pos_num=-62, max_tangential_pos_num=62) at /usr/devel/build/sources/STIR/src/recon_buildblock/BackProjectorByBin.cxx:387 #6 0x00007f48e4c7b141 in stir::BackProjectorByBin::back_project (this=0x3a66aa0, viewgrams=..., min_axial_pos_num=0, max_axial_pos_num=0, min_tangential_pos_num=-62, max_tangential_pos_num=62) at /usr/devel/build/sources/STIR/src/recon_buildblock/BackProjectorByBin.cxx:299 #7 0x00007f48e4c7c5cc in stir::BackProjectorByBin::_ZN4stir18BackProjectorByBin12back_projectERKNS_8ProjDataEii._omp_fn.0(void) () at /usr/devel/build/sources/STIR/src/recon_buildblock/BackProjectorByBin.cxx:223 #8 0x00007f48f643b78e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1 #9 0x00007f490daa6609 in start_thread (arg=) at pthread_create.c:477 #10 0x00007f490dbe0163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

$ (gdb) up #1 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy (this=0x211) at /usr/include/c++/9/bits/shared_ptr_base.h:139 139 { __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); } $ (gdb) up #2 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::operator= (__r=..., this=) at /usr/include/c++/9/bits/shared_ptr_base.h:747 747 __tmp->_M_add_ref_copy(); $ (gdb) down #1 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy (this=0x211) at /usr/include/c++/9/bits/shared_ptr_base.h:139 139 { __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); } $ (gdb) down #0 __gnu_cxx::__atomic_add_dispatch (__val=1, __mem=0x219) at /usr/include/c++/9/ext/atomicity.h:96 96 __atomic_add(__mem, __val); $ (gdb) down Bottom (innermost) frame selected; you cannot go down. $ (gdb) list 91 attribute ((unused)) 92 __atomic_add_dispatch(_Atomic_word* __mem, int __val) 93 { 94 #ifdef __GTHREADS 95 if (__gthread_active_p()) 96 __atomic_add(__mem, __val); 97 else 98 __atomic_add_single(__mem, __val); 99 #else 100 __atomic_add_single(__mem, __val);

May 03 '22 13:05 samdporter

hmmm. Potentially multi-threading stuff. could be a red-herring. Nevertheless, can you please compile STIR without openMP

cd ..../builds/STIR/build
cmake -DSTIR_OPENMP:BOOL=OFF .
make -j6 install
cd ../../SIRF/build
make -j6 install

May 03 '22 14:05 KrisThielemans

your valgrind log is from a build without debug info, so we don't have any line info. Can you re-run? Probably best to attach as a log file though, not paste the actual content.

May 03 '22 14:05 KrisThielemans

ideally re-run before building non-OEPNMP version.

May 03 '22 14:05 KrisThielemans

valgrind log

This is the valgrind output with the debugger on (I've checked it's on in cmake, but it looks the same to me)

May 10 '22 13:05 samdporter

Looks like it is an issue with multi-threading. Everything runs fine with openMP off

May 10 '22 13:05 samdporter

Hi, The problem seems to arise after creating a CIL gradient operator with a C backend see: mimimal.ipynb The final cell fails

The algorithm runs fine when using Numpy. Should I raise this as a CIL issue?

May 12 '22 15:05 samdporter

good detective work! Pretty weird though. Does this then depend on SPECT at all? i.e. can you just save an image first, then load it in a new kernel and do the CIL operation?

tagging @epapoutsellis @gfardell @paskino

May 12 '22 15:05 KrisThielemans

minimal.ipynb does not work. It runs ok with backend=numpy in the GradientOperator?

May 12 '22 16:05 epapoutsellis

Yes exactly

May 12 '22 16:05 samdporter

GradientOperator with C backend expects contiguous data as in a numpy ndarray or a float* C array. I think it gets passed a SIRF object with which it doesn't have any clue what to do. So, I think this is expected. Use backend='numpy'.

The link you provide to minimal.ipynb returns a 404.

May 13 '22 10:05 paskino

Sorry the repository was private. Try now

May 13 '22 10:05 samdporter

GradientOperator with C backend expects contiguous data as in a numpy ndarray or a float* C array. I think it gets passed a SIRF object with which it doesn't have any clue what to do. So, I think this is expected. Use backend='numpy'.

The link you provide to minimal.ipynb returns a 404.

Why would this be different for PET?

May 13 '22 10:05 samdporter

Just to understand, the last cell contains

### fails ###
tmp = am.forward(fdg)

where am is defined above as am = STIR.AcquisitionModelUsingMatrix(acq_model_matrix). As far as I can understand this has nothing to do with the cell above GradientOperator

grad = operators.GradientOperator(fdg, method='backward')

where you probably have also unveiled a bug as the method is not saved into the instance and therefore not used. IMHO it is should even fail .

May 13 '22 11:05 paskino

grad = operators.GradientOperator(fdg, method='backward') This actually runs because method = 'backward' uses the Numpy backend. I've update the notebook with a c-backend grad operator that does fail (apologies) I agree that it has nothing to do with the am operator. It's very confusing. In the notebook I was trying to show all the things that you can do with no problems and then the final two cells cause a problem (if that makes sense). The last cell runs with a GradientOperator numpy backend in the cell above and not with a GradientOperator C backend

May 13 '22 11:05 samdporter

no, the method is passed over via the kwargs.

May 13 '22 11:05 paskino

What's the situation with this?

Do we have a SPECT problem, a CIL problem, or we don't know?

@samdporter, did you set keep_all_views_in_cache? If not, can you set it to true? This should enable multi-threading for SPECT (and avoid the horrible/dangerous work-around in https://github.com/UCL/STIR/blob/25b65645f9b532ed5129333c096a25faab8ae07a/src/recon_buildblock/ProjMatrixByBinSPECTUB.cxx#L244-L248). Note that this work-around shouldn't do anything if we set_num_threads(1) , at least for STIR. I have no idea that what omp_set_num_threads does when called in STIR, but openmp is used in CIL. @gfardell any idea?

May 22 '22 08:05 KrisThielemans

Latest update from @samdporter on discord is that PET_SPECT_Synergestic works provided the CIL GradientOperator uses the numpy backend.

May 22 '22 18:05 epapoutsellis

Hi, update: Setting up and running two SPECT reconstructions as in https://github.com/samdporter/SPECTissues/blob/main/OSEM.py results in error:

terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
Aborted (core dumped)

I'm having trouble building SIRF with Debug info but will comment when I am able to. I believe that I have all the relevant updates to SPECTUBMatrix on my local build

Jul 04 '22 08:07 samdporter

we need the text of the exception. Presumably it was redirected to a file?

Jul 04 '22 17:07 KrisThielemans