Exception when cleaning up ThreadLocalStore learner when using Cuda
Using VS 17.5.3 and the C API with branch release_2.0.0
Creating a "worker thread" to fit a model using XGBoosterUpdateOneIter() and using the option XGBoosterSetParam(booster, "device", "cuda");. When worker thread exits, I get a dmlc::Error exception with this call stack:
xgboost.dll!dmlc::LogMessageFatal::~LogMessageFatal() Line 428 C++
xgboost.dll!dh::ThrowOnCudaError(enum cudaError,char const *,int) C++
xgboost.dll!xgboost::HostDeviceVectorImpl<float>::SetDevice(void) C++
xgboost.dll!xgboost::HostDeviceVectorImpl<float>::`scalar deleting destructor'(unsigned int) C++
xgboost.dll!xgboost::HostDeviceVector<float>::~HostDeviceVector<float>(void) C++
xgboost.dll!xgboost::XGBAPIThreadLocalEntry::~XGBAPIThreadLocalEntry() C++
xgboost.dll!std::_Tree_val<std::_Tree_simple_types<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>>>::_Erase_tree<std::allocator<std::_Tree_node<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>,void *>>>(std::allocator<std::_Tree_node<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>,void *>> & _Al, std::_Tree_node<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>,void *> * _Rootnode) Line 747 C++
[Inline Frame] xgboost.dll!std::_Tree_val<std::_Tree_simple_types<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>>>::_Erase_head(std::allocator<std::_Tree_node<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>,void *>> &) Line 754 C++
[Inline Frame] xgboost.dll!std::_Tree<std::_Tmap_traits<xgboost::Learner const *,xgboost::XGBAPIThreadLocalEntry,std::less<xgboost::Learner const *>,std::allocator<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>>,0>>::{dtor}() Line 1081 C++
xgboost.dll!`dmlc::ThreadLocalStore<std::map<xgboost::Learner const *,xgboost::XGBAPIThreadLocalEntry,std::less<xgboost::Learner const *>,std::allocator<std::pair<xgboost::Learner const * const,xgboost::XGBAPIThreadLocalEntry>>>>::Get'::`2'::`dynamic atexit destructor for 'inst''() C++
xgboost.dll!__dyn_tls_dtor(void * __formal, const unsigned long dwReason, void * __formal) Line 119 C++
I do not get this exception when I use the cpu device version.
Could you please share the error message? Also, how does the worker thread exit, is the OS reclaiming the thread or are you joining the thread yourself?
[09:06:21] C:\eclwork\Source\3rdParty\xgboost\xgboost\src\common\common.h:45: C:\eclwork\Source\3rdParty\xgboost\xgboost\src\common\host_device_vector.cu: 264: cudaErrorInitializationError: initialization error
It looks like that calling XGBoosterPredictFromDMatrix() is what is triggering this condition.
The worker thread is a normal std::thread and join() is being called by the main thread after the thread is signaled to return.
Functions are called in this order:
XGDMatrixCreateFromCallback() //training data Xy
XGDMatrixCreateFromCallback() //test data XyTest
callbacks are using:
XGProxyDMatrixSetDataDense()
XGDMatrixSetDenseInfo() // setting "label"
The rest of the calls
XGBoosterCreate() // cache of Xy and XyTest
XGBoosterSetParam() //multiple calls
for () {
XGBoosterUpdateOneIter() // Xy
XGBoosterEvalOneIter()
}
XGDMatrixFree(Xy)
XGBoosterPredictFromDMatrix() //XyTest
XGDMatrixGetFloatInfo(XyTest, "label"...)
XGDMatrixFree(XyTest)
Thread returns and main thread takes ownership of the BoosterHandle
Unfortunately, I don't quite understand the cause of this at the moment. I don't use Windows myself. Based on the error message, my guess is that during the destruction of the thread's local memory, the cuda runtime context is destroyed by the system before XGBoost can free up its device memory.
If my guess is correct, then we will have to invent a new predict function that returns a memory handle and ask the users to manage the returned prediction buffer.