deadlock when training torch model
Hi @dfalbel ,
I'm having an issue that the model training just hangs when training models. This doesn't happen always but has now happened twice in the last week. I'm using torch 0.10 on ubuntu 20.
I'm attaching backtraces generated with gdb and thread apply all bt full when attaching to the hanging process.
Any idea what could be going on or how to debug further?
HI @egillax ,
Is the process running torch forked at some point? Usually, forking is not safe when using LibTorch, if doing parallel work it's better to use multi-process parallelization. If you really need forking, you must make sure that you don't use autograd in the main process (the one that will be forked) before forking, otherwise bad things can happen, incluing deadlocks like this one.
There's some discussion here: https://github.com/mlverse/torch/issues/971
Hi @dfalbel,
No there shouldn't be any forking. There should be only one process running the model training. I did notice though that for the first deadlock that happened there was another process from an older rsession occupying gpu memory. But for the later deadlock there was only one process using the gpu, not sure that's relevant.
This is weird! It's interesting to see some JVM symbols in the backtrace. Do you know where they could come from?
Eg:
#7 0x00007fece5a97406 in ?? () from /usr/lib/jvm/java-11-openjdk-amd64/lib/server/libjvm.so
It seems that the deadlock situation is very similar to what we saw in #971: Autograd is running and, it tries to allocate more memory, which is not possible because GPU is at it's max, it then tries to cleanup some memory by calling the GC, which in turn tries to release some memory but gets deadlocked.
Looking at the traceback to understand what's happening:
- During autograd, a free memory callback is called and thus the delete tasks event loop start running:
#5 0x00007f36286bd604 in EventLoop<void>::run() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#6 0x00007f36286bc565 in wait_for_gc() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
- Calling the R garbage collector is requested, and we see it get's called:
R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#11 0x00007fed64976562 in _lantern_Tensor_delete () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#12 0x00007fed65763ad9 in lantern_Tensor_delete (x=0x561f72906d00) at ../inst/include/lantern/lantern.h:316
No locals.
#13 delete_tensor (x=0x561f72906d00) at torch_api.cpp:143
However, this call is in the main thread, and thus can't acquire the lock to delete tensors, and should be rescheduled to the autograd thread (the on that is running the delete_tasks envent loop). This reschedule should happne because of
https://github.com/mlverse/torch/blob/8a3b5b3f5da44c3254cef0eb48c948a7298a5a2d/src/lantern/src/Delete.cpp#L15C1-L24
For some reason though it seems that delete_tasks.is_running returns false and the deletion happens in the main thread and deadlocks:
__PRETTY_FUNCTION__ = "__pthread_mutex_lock"
id = <optimized out>
#2 0x00007fed63581f54 in c10::cuda::CUDACachingAllocator::raw_delete(void*) () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/libc10_cuda.so
No symbol table info available.
#3 0x00007fed63b37998 in c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::reset_() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/libc10.so
No symbol table info available.
#4 0x00007fed63b310ef in c10::TensorImpl::~TensorImpl() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/libc10.so
No symbol table info available.
#5 0x00007fed63b311b9 in c10::TensorImpl::~TensorImpl() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/libc10.so
No symbol table info available.
#6 0x00007fed648cdf3c in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#7 0x00007fed648c9662 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#8 0x00007fed6486ecea in at::TensorBase::~TensorBase() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
No symbol table info available.
#9 0x00007fed6486f162 in at::Tensor::~Tensor() () from /home/efridgeirsson/StrategusModules/DeepPatientLevelPredictionModule_0.0.8/renv/library/R-4.1/x86_64-pc-linux-gnu/torch/lib/liblantern.so
I'm not sure where the jvm stuff is coming from. There are packages earlier in my pipeline using java, to connect to a database and fetch the data.
I'm trying now to reproduce the issue in a simpler setting. Originally this happened on a server running my full pipeline. Now I'm trying to run only the affected code on the server with same data and separately on my laptop with fake data.
I just ran into this again now when running the affected code segment manually, now I'm sure there is no forking happening anywhere. I've attached the gdb backtrace in case it helps. Is there anything else I can do to get more info about this?
@egillax ! Thanks for the backtrace. Is the code running into this problem public? I'll try to reproduce it, with a minimal example, but it would be nice to take a look at the code to see if I could see other clues. So far, it seems that it's caused by a
GC call during a backward() call on GPU. But I think that should definitely happen more often if this was the only cause, as backward allocates a lot of memory and is very likely to call GC.
Yes the code is public. The main training loop is in this class which is instantiated and fit during hyperparameter tuning here. The issue has both happened with a ResNet and a Transformer.
I'm also trying to make a more minimal example I can share, will post it here if I manage.