SIRF icon indicating copy to clipboard operation
SIRF copied to clipboard

SPECT reconstruction causes error with CIL GradientOperator if OPENMP=ON

Open samdporter opened this issue 3 years ago • 21 comments

Segmentation error (core dumped)

After running minimalCIL.py @ https://github.com/samdporter/SPECTissues.git

CIL's admm causes failure. OSEM reconstruction works OK.

Ran valgrind & will post in next comment

samdporter avatar May 03 '22 11:05 samdporter

==85870== 64,167,432 bytes in 126 blocks are still reachable in loss record 7,248 of 7,250 ==85870== at 0x483C583: operator new[](unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==85870== by 0x30673740: stir::ProjMatrixByBinSPECTUB::set_up(std::shared_ptr<stir::ProjDataInfo const> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> const> const&) (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x30682166: stir::ProjectorByBinPair::set_up(std::shared_ptr<stir::ProjDataInfo const> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> const> const&) (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x306826BC: stir::ProjectorByBinPairUsingProjMatrixByBin::set_up(std::shared_ptr<stir::ProjDataInfo const> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> const> const&) (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x3059DB54: sirf::PETAcquisitionModel::set_up(std::shared_ptrsirf::PETAcquisitionData, std::shared_ptrsirf::STIRImageData) (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x3056A3F6: sirf::PETAcquisitionModelUsingMatrix::set_up(std::shared_ptrsirf::PETAcquisitionData, std::shared_ptrsirf::STIRImageData) (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x30554F40: cSTIR_setupAcquisitionModel (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x3052DE33: _wrap_cSTIR_setupAcquisitionModel (in /usr/devel/build/INSTALL/python/sirf/_pystir.so) ==85870== by 0x5F3A29: PyCFunction_Call (in /usr/bin/python3.8) ==85870== by 0x5F3E1D: _PyObject_MakeTpCall (in /usr/bin/python3.8) ==85870== by 0x570673: _PyEval_EvalFrameDefault (in /usr/bin/python3.8) ==85870== by 0x5F6835: _PyFunction_Vectorcall (in /usr/bin/python3.8) ==85870== ==85870== 113,746,225 bytes in 1 blocks are still reachable in loss record 7,249 of 7,250 ==85870== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==85870== by 0x5F8A00: PyBytes_FromStringAndSize (in /usr/bin/python3.8) ==85870== by 0x5F903B: ??? (in /usr/bin/python3.8) ==85870== by 0x646F95: ??? (in /usr/bin/python3.8) ==85870== by 0x5C4675: ??? (in /usr/bin/python3.8) ==85870== by 0x4F290D: ??? (in /usr/bin/python3.8) ==85870== by 0x64F717: ??? (in /usr/bin/python3.8) ==85870== by 0x5048B2: ??? (in /usr/bin/python3.8) ==85870== by 0x56B1D9: _PyEval_EvalFrameDefault (in /usr/bin/python3.8) ==85870== by 0x569399: _PyEval_EvalCodeWithName (in /usr/bin/python3.8) ==85870== by 0x5F6A12: _PyFunction_Vectorcall (in /usr/bin/python3.8) ==85870== by 0x56B1D9: _PyEval_EvalFrameDefault (in /usr/bin/python3.8) ==85870== ==85870== 240,458,752 bytes in 2 blocks are still reachable in loss record 7,250 of 7,250 ==85870== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==85870== by 0x5806F33: default_malloc (in /usr/local/lib/python3.8/dist-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so) ==85870== by 0x580764E: PyDataMem_UserNEW (in /usr/local/lib/python3.8/dist-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so) ==85870== by 0x58644B9: PyArray_NewFromDescr_int (in /usr/local/lib/python3.8/dist-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so) ==85870== by 0x580ABB7: array_new (in /usr/local/lib/python3.8/dist-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so) ==85870== by 0x5F3D02: _PyObject_MakeTpCall (in /usr/bin/python3.8) ==85870== by 0x570AF8: _PyEval_EvalFrameDefault (in /usr/bin/python3.8) ==85870== by 0x569399: _PyEval_EvalCodeWithName (in /usr/bin/python3.8) ==85870== by 0x5F6A12: _PyFunction_Vectorcall (in /usr/bin/python3.8) ==85870== by 0x56C28B: _PyEval_EvalFrameDefault (in /usr/bin/python3.8) ==85870== by 0x5F6835: _PyFunction_Vectorcall (in /usr/bin/python3.8) ==85870== by 0x59DDFE: ??? (in /usr/bin/python3.8) ==85870== ==85870== LEAK SUMMARY: ==85870== definitely lost: 1,816 bytes in 15 blocks ==85870== indirectly lost: 0 bytes in 0 blocks ==85870== possibly lost: 699,830 bytes in 629 blocks ==85870== still reachable: 514,605,355 bytes in 80,204 blocks ==85870== of which reachable via heuristic: ==85870== stdstring : 25,582 bytes in 681 blocks ==85870== newarray : 248,504 bytes in 247 blocks ==85870== suppressed: 0 bytes in 0 blocks ==85870== ==85870== ERROR SUMMARY: 78258 errors from 554 contexts (suppressed: 0 from 0)

Possible memory leak?

samdporter avatar May 03 '22 11:05 samdporter

results from gbd:

Thread 34 "python3" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fc781aad700 (LWP 261805)] __gnu_cxx::__atomic_add_dispatch (__val=1, __mem=0x249) at /usr/include/c++/9/ext/atomicity.h:96 96 __atomic_add(__mem, __val);

#0 __gnu_cxx::__atomic_add_dispatch (__val=1, __mem=0x219) at /usr/include/c++/9/ext/atomicity.h:96 #1 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy (this=0x211) at /usr/include/c++/9/bits/shared_ptr_base.h:139 #2 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::operator= (__r=..., this=) at /usr/include/c++/9/bits/shared_ptr_base.h:747 #3 std::__shared_ptr<stir::DiscretisedDensity<3, float>, (__gnu_cxx::_Lock_policy)2>::operator= (this=) at /usr/include/c++/9/bits/shared_ptr_base.h:1080 #4 std::shared_ptr<stir::DiscretisedDensity<3, float> >::operator= (this=) at /usr/include/c++/9/bits/shared_ptr.h:103 #5 stir::BackProjectorByBin::actual_back_project (this=0x3a66aa0, viewgrams=..., min_axial_pos_num=0, max_axial_pos_num=0, min_tangential_pos_num=-62, max_tangential_pos_num=62) at /usr/devel/build/sources/STIR/src/recon_buildblock/BackProjectorByBin.cxx:387 #6 0x00007f48e4c7b141 in stir::BackProjectorByBin::back_project (this=0x3a66aa0, viewgrams=..., min_axial_pos_num=0, max_axial_pos_num=0, min_tangential_pos_num=-62, max_tangential_pos_num=62) at /usr/devel/build/sources/STIR/src/recon_buildblock/BackProjectorByBin.cxx:299 #7 0x00007f48e4c7c5cc in stir::BackProjectorByBin::_ZN4stir18BackProjectorByBin12back_projectERKNS_8ProjDataEii._omp_fn.0(void) () at /usr/devel/build/sources/STIR/src/recon_buildblock/BackProjectorByBin.cxx:223 #8 0x00007f48f643b78e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1 #9 0x00007f490daa6609 in start_thread (arg=) at pthread_create.c:477 #10 0x00007f490dbe0163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

$ (gdb) up #1 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy (this=0x211) at /usr/include/c++/9/bits/shared_ptr_base.h:139 139 { __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); } $ (gdb) up #2 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::operator= (__r=..., this=) at /usr/include/c++/9/bits/shared_ptr_base.h:747 747 __tmp->_M_add_ref_copy(); $ (gdb) down #1 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy (this=0x211) at /usr/include/c++/9/bits/shared_ptr_base.h:139 139 { __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); } $ (gdb) down #0 __gnu_cxx::__atomic_add_dispatch (__val=1, __mem=0x219) at /usr/include/c++/9/ext/atomicity.h:96 96 __atomic_add(__mem, __val); $ (gdb) down Bottom (innermost) frame selected; you cannot go down. $ (gdb) list 91 attribute ((unused)) 92 __atomic_add_dispatch(_Atomic_word* __mem, int __val) 93 { 94 #ifdef __GTHREADS 95 if (__gthread_active_p()) 96 __atomic_add(__mem, __val); 97 else 98 __atomic_add_single(__mem, __val); 99 #else 100 __atomic_add_single(__mem, __val);

samdporter avatar May 03 '22 13:05 samdporter

hmmm. Potentially multi-threading stuff. could be a red-herring. Nevertheless, can you please compile STIR without openMP

cd ..../builds/STIR/build
cmake -DSTIR_OPENMP:BOOL=OFF .
make -j6 install
cd ../../SIRF/build
make -j6 install

KrisThielemans avatar May 03 '22 14:05 KrisThielemans

your valgrind log is from a build without debug info, so we don't have any line info. Can you re-run? Probably best to attach as a log file though, not paste the actual content.

KrisThielemans avatar May 03 '22 14:05 KrisThielemans

ideally re-run before building non-OEPNMP version.

KrisThielemans avatar May 03 '22 14:05 KrisThielemans

valgrind log

This is the valgrind output with the debugger on (I've checked it's on in cmake, but it looks the same to me)

samdporter avatar May 10 '22 13:05 samdporter

Looks like it is an issue with multi-threading. Everything runs fine with openMP off

samdporter avatar May 10 '22 13:05 samdporter

Hi, The problem seems to arise after creating a CIL gradient operator with a C backend see: mimimal.ipynb The final cell fails

The algorithm runs fine when using Numpy. Should I raise this as a CIL issue?

samdporter avatar May 12 '22 15:05 samdporter

good detective work! Pretty weird though. Does this then depend on SPECT at all? i.e. can you just save an image first, then load it in a new kernel and do the CIL operation?

tagging @epapoutsellis @gfardell @paskino

KrisThielemans avatar May 12 '22 15:05 KrisThielemans

minimal.ipynb does not work. It runs ok with backend=numpy in the GradientOperator?

epapoutsellis avatar May 12 '22 16:05 epapoutsellis

Yes exactly

samdporter avatar May 12 '22 16:05 samdporter

GradientOperator with C backend expects contiguous data as in a numpy ndarray or a float* C array. I think it gets passed a SIRF object with which it doesn't have any clue what to do. So, I think this is expected. Use backend='numpy'.

The link you provide to minimal.ipynb returns a 404.

paskino avatar May 13 '22 10:05 paskino

Sorry the repository was private. Try now

samdporter avatar May 13 '22 10:05 samdporter

GradientOperator with C backend expects contiguous data as in a numpy ndarray or a float* C array. I think it gets passed a SIRF object with which it doesn't have any clue what to do. So, I think this is expected. Use backend='numpy'.

The link you provide to minimal.ipynb returns a 404.

Why would this be different for PET?

samdporter avatar May 13 '22 10:05 samdporter

Just to understand, the last cell contains

### fails ###
tmp = am.forward(fdg)

where am is defined above as am = STIR.AcquisitionModelUsingMatrix(acq_model_matrix). As far as I can understand this has nothing to do with the cell above GradientOperator

grad = operators.GradientOperator(fdg, method='backward')

where you probably have also unveiled a bug as the method is not saved into the instance and therefore not used. IMHO it is should even fail .

paskino avatar May 13 '22 11:05 paskino

grad = operators.GradientOperator(fdg, method='backward') This actually runs because method = 'backward' uses the Numpy backend. I've update the notebook with a c-backend grad operator that does fail (apologies) I agree that it has nothing to do with the am operator. It's very confusing. In the notebook I was trying to show all the things that you can do with no problems and then the final two cells cause a problem (if that makes sense). The last cell runs with a GradientOperator numpy backend in the cell above and not with a GradientOperator C backend

samdporter avatar May 13 '22 11:05 samdporter

no, the method is passed over via the kwargs.

paskino avatar May 13 '22 11:05 paskino

What's the situation with this?

Do we have a SPECT problem, a CIL problem, or we don't know?

@samdporter, did you set keep_all_views_in_cache? If not, can you set it to true? This should enable multi-threading for SPECT (and avoid the horrible/dangerous work-around in https://github.com/UCL/STIR/blob/25b65645f9b532ed5129333c096a25faab8ae07a/src/recon_buildblock/ProjMatrixByBinSPECTUB.cxx#L244-L248). Note that this work-around shouldn't do anything if we set_num_threads(1) , at least for STIR. I have no idea that what omp_set_num_threads does when called in STIR, but openmp is used in CIL. @gfardell any idea?

KrisThielemans avatar May 22 '22 08:05 KrisThielemans

Latest update from @samdporter on discord is that PET_SPECT_Synergestic works provided the CIL GradientOperator uses the numpy backend.

epapoutsellis avatar May 22 '22 18:05 epapoutsellis

Hi, update: Setting up and running two SPECT reconstructions as in https://github.com/samdporter/SPECTissues/blob/main/OSEM.py results in error:

terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
Aborted (core dumped)

I'm having trouble building SIRF with Debug info but will comment when I am able to. I believe that I have all the relevant updates to SPECTUBMatrix on my local build

samdporter avatar Jul 04 '22 08:07 samdporter

we need the text of the exception. Presumably it was redirected to a file?

KrisThielemans avatar Jul 04 '22 17:07 KrisThielemans