CUDA Error: initialization error

Open lpanaf opened this issue 1 year ago • 1 comments

Thanks for the nice optimization library.

I am now trying to use BaSpaCho within the Theseus framework. However, I encountered some problems when trying to run it together with multi-processing. For instance, if I use Dataloader with multi-worker or run some parts of my code with multiprocessing, it raises a "CUDA Error: initialization error". I have checked that simply switching from BaspachoSparseSolver to LUCudaSparseSolver or CholeskyDenseSolver can solve this problem. Also, the error message does not pop out when I run my code in debug mode. Do you have any idea what the problem can be?

First parts of my error message

[xxx/baspacho/baspacho/baspacho/CudaDefs.h:82] CUDA Error: initialization error
*** Aborted at 1733329525 (unix time) try "date -d @1733329525" if you are using GNU date ***
PC: @     0x14bf9b2489fc pthread_kill
*** SIGABRT (@0x35e9400181b99) received by PID 1579929 (TID 0x14bf951ff640) from PID 1579929; stack trace: ***
    @     0x14bf9b24bee8 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x99ee7)
    @     0x14bf996e82ed (xxx/venv/lib/python3.11/site-packages/pycolmap.cpython-311-x86_64-linux-gnu.so+0x5842ec)
    @     0x14bf9b1f4520 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x4251f)
    @     0x14bf9b2489fc pthread_kill
    @     0x14bf9b1f4476 raise
    @     0x14bf9b1da7f3 abort
    @     0x14be27d512f1 BaSpaCho::CudaSymbolicCtx::~CudaSymbolicCtx()
    @     0x14be27c74b04 std::_Sp_counted_ptr_inplace<SymbolicDecompositionData, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
    @     0x14be27c73dc7 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
    @     0x14be27c74dcd pybind11::class_<NumericDecomposition>::dealloc(pybind11::detail::value_and_holder&)
    @     0x14bf8c4a7da3 pybind11::detail::clear_instance(_object*)
    @     0x14bf8c4a8d21 pybind11_object_dealloc

Dec 04 '24 16:12 lpanaf

Thanks a lot for the report Ipanaf, Sorry to hear of the crash. I would like to look into this, can you please provide me with a small repro, and the information of the version of Cuda you're using?

I see the crash occurs in the destructor when a cudeFree is called so it might be because of Python interface (eg related to multithreading, or cause by the garbage collector delayed execution).

Thanks, Maurizio

Jan 17 '25 15:01 maurimo