baspacho icon indicating copy to clipboard operation
baspacho copied to clipboard

CUDA Error: initialization error

Open lpanaf opened this issue 1 year ago • 1 comments

Thanks for the nice optimization library.

I am now trying to use BaSpaCho within the Theseus framework. However, I encountered some problems when trying to run it together with multi-processing. For instance, if I use Dataloader with multi-worker or run some parts of my code with multiprocessing, it raises a "CUDA Error: initialization error". I have checked that simply switching from BaspachoSparseSolver to LUCudaSparseSolver or CholeskyDenseSolver can solve this problem. Also, the error message does not pop out when I run my code in debug mode. Do you have any idea what the problem can be?

First parts of my error message

[xxx/baspacho/baspacho/baspacho/CudaDefs.h:82] CUDA Error: initialization error
*** Aborted at 1733329525 (unix time) try "date -d @1733329525" if you are using GNU date ***
PC: @     0x14bf9b2489fc pthread_kill
*** SIGABRT (@0x35e9400181b99) received by PID 1579929 (TID 0x14bf951ff640) from PID 1579929; stack trace: ***
    @     0x14bf9b24bee8 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x99ee7)
    @     0x14bf996e82ed (xxx/venv/lib/python3.11/site-packages/pycolmap.cpython-311-x86_64-linux-gnu.so+0x5842ec)
    @     0x14bf9b1f4520 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x4251f)
    @     0x14bf9b2489fc pthread_kill
    @     0x14bf9b1f4476 raise
    @     0x14bf9b1da7f3 abort
    @     0x14be27d512f1 BaSpaCho::CudaSymbolicCtx::~CudaSymbolicCtx()
    @     0x14be27c74b04 std::_Sp_counted_ptr_inplace<SymbolicDecompositionData, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
    @     0x14be27c73dc7 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
    @     0x14be27c74dcd pybind11::class_<NumericDecomposition>::dealloc(pybind11::detail::value_and_holder&)
    @     0x14bf8c4a7da3 pybind11::detail::clear_instance(_object*)
    @     0x14bf8c4a8d21 pybind11_object_dealloc

lpanaf avatar Dec 04 '24 16:12 lpanaf

Thanks a lot for the report Ipanaf, Sorry to hear of the crash. I would like to look into this, can you please provide me with a small repro, and the information of the version of Cuda you're using?

I see the crash occurs in the destructor when a cudeFree is called so it might be because of Python interface (eg related to multithreading, or cause by the garbage collector delayed execution).

Thanks, Maurizio

maurimo avatar Jan 17 '25 15:01 maurimo