mxnet icon indicating copy to clipboard operation
mxnet copied to clipboard

Better handling of the engine destruction

Open ptrendx opened this issue 5 years ago • 13 comments

Description

Currently engine is a singleton, the lifetime of which is controlled by the static shared_ptr. This means that the engine is destroyed at the program exit in an unspecified time (depending on the linking order for example). This was the reason of the issue #19360, which was WARed by #19378. As noted in https://github.com/apache/incubator-mxnet/issues/19360#issuecomment-712361751 the real solution should be a better handling of the lifetime of the engine. Once this is implemented, #19378 should be reverted.

@szha @mseth10

ptrendx avatar Oct 19 '20 20:10 ptrendx

Looks like some MXNet users are blocked by this issue . Do you guys have some ideas how we could better handle the engine destruction and fix the issue? @ptrendx @TristonC @szha

waytrue17 avatar Mar 25 '22 00:03 waytrue17

Add @DickJC123 for discussion too

TristonC avatar Mar 25 '22 21:03 TristonC

This is more important now that #19378 was reverted. I am seeing the segfault described in #19360

(gdb) bt
#0  0x00007efccaa7d277 in raise () from /lib64/libc.so.6
#1  0x00007efccaa7e968 in abort () from /lib64/libc.so.6
#2  0x00007efccaabfd37 in __libc_message () from /lib64/libc.so.6
#3  0x00007efccaac8499 in _int_free () from /lib64/libc.so.6
#4  0x00007efccaa80c00 in __run_exit_handlers () from /lib64/libc.so.6
#5  0x00007efccaa80c27 in exit () from /lib64/libc.so.6
#6  0x00007efc5531808d in ?? () from <snip>/mxnet/lib/python3.9/site-packages/mxnet/libmxnet.so
#7  <signal handler called>
#8  0x00007efbc1836061 in ?? () from /opt/apps/cudnn/8.2.4_cuda10.2/lib64/libcudnn_ops_infer.so.8
#9  0x00007efbc1861c00 in ?? () from /opt/apps/cudnn/8.2.4_cuda10.2/lib64/libcudnn_ops_infer.so.8
#10 0x00007efbc0fa7edf in cudnnDestroy () from /opt/apps/cudnn/8.2.4_cuda10.2/lib64/libcudnn_ops_infer.so.8
#11 0x00007efc552619b6 in mshadow::Stream<mshadow::gpu>::DestroyDnnHandle() () from <snip>/mxnet/lib/python3.9/site-packages/mxnet/libmxnet.so
#12 0x00007efc55261b78 in void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*) () from <snip>/mxnet/lib/python3.9/site-packages/mxnet/libmxnet.so
#13 0x00007efc55276607 in void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) () from <snip>/mxnet/lib/python3.9/site-packages/mxnet/libmxnet.so
#14 0x00007efc5527683e in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>) ()
   from <snip>/mxnet/lib/python3.9/site-packages/mxnet/libmxnet.so
#15 0x00007efc5527381b in std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() ()
   from <snip>/mxnet/lib/python3.9/site-packages/mxnet/libmxnet.so
#16 0x00007efc84cba2bd in std::execute_native_thread_routine_compat (__p=<optimized out>)
    at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
#17 0x00007efccae1be25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007efccab45bad in clone () from /lib64/libc.so.6

joelnn avatar Jun 02 '22 20:06 joelnn

+1 on what @joelnn said. We are trying to upgrade mxnet from 1.8.0 to 1.9.1, but we are blocked because #19378 was reverted. Is there a reason why #19378 was reverted? Will there be any alternative fix being proposed soon?

junyuc25 avatar Jun 24 '22 15:06 junyuc25

Wanted to see if there is any follow up or suggested workaround on this ticket. I am unable to upgrade Cudnn from 7.6.5 to 8.4.1 because MxNet Segfaults when distributed training ends during a clean up process. Looking for any workarounds, can we use LD_PRELOAD to help changing linking order, or anything to circumvent this issue.

gkang2018 avatar Feb 14 '23 18:02 gkang2018

@DickJC123 and @ptrendx are checking this issue. @gkang2018 to make it clear, you were trying to make MXNet 1.9.1 works with cuDNN 8.4.1, correct?

TristonC avatar Feb 14 '23 19:02 TristonC

Yes, I also see this issue with cudnn 8.6 as well which is the new version that I want to upgrade to. Do we have any ETA from @DickJC123 or @ptrendx?

gkang2018 avatar Feb 14 '23 19:02 gkang2018

Not yet. But we should know soon.

TristonC avatar Feb 14 '23 19:02 TristonC

@gkang2018 Which CUDA version did you use?

TristonC avatar Feb 14 '23 19:02 TristonC

Right now we are using CUDA 10.2

gkang2018 avatar Feb 14 '23 19:02 gkang2018

Could you try CUDA 11.something? There was a change in 11.2 I believe that should help here.

ptrendx avatar Feb 14 '23 19:02 ptrendx

I can try to use a different CUDA version. However, for my production use case we are stuck on 10.2 until we can upgrade to 11.7. So I'd love to see a workaround for 10.2. But let me try it out on a higher cuda version just to confirm

gkang2018 avatar Feb 14 '23 19:02 gkang2018

@gkang2018 Any update on testing with CUDA 11.7?

TristonC avatar Feb 16 '23 18:02 TristonC