eddl
eddl copied to clipboard
LSTM training fail on single GPU, but not with multiple GPUs
With the latest versions of EDDL (1.2.0) and ECVL (1.1.0), I get a CUDA error when training the model using a single GPU. I have no problems when using 2 or 4 GPUs. The error occurs systematically at the beginning of the third epoch and does not seem to depend on the batch size. It does not depend on the memory consumption parameter (“full_mem”, “mid_mem” or “low_mem”), I tried all of them. The GPU is a nVidia V100. With previous versions of the libraries, this error did not occur (but I was using a different GPU).
.Traceback (most recent call last):
File "C01_2_rec_mod_edll.py", line 98, in <module>
fire.Fire({
File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "C01_2_rec_mod_edll.py", line 46, in train
rec_mod.train()
File "/mnt/datasets/uc5/UC5_pipeline_forked/src/eddl_lib/recurrent_module.py", line 289, in train
eddl.train_batch(rnn, [cnn_visual, thresholded], [Y])
File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/pyeddl/eddl.py", line 435, in train_batch
return _eddl.train_batch(net, in_, out)
RuntimeError: [CUDA ERROR]: invalid argument (1) raised in delete_tensor | (check_cuda)
The code is not yet available on the repository, please let me know what details I can add.
Can you send a minimal script to debug it? With that information, I'm a bit lost
Hello @thistlillo, we have been debugging this issue but we have not been able to reproduce the problem. Our tests surpass five epochs using both configurations with 1 and 2 GPUs. Do you think we can help you in a virtual meeting?
Hello @bernia and sorry for this late reply, but I did not receive any notification from github about your reply. I have now installed version 1.3 and next week I will perform some more tests. I will report back here.
The code published for UC5 is not up-to-date, now it uses also the ECVL dataloader. I work on a fork that periodically join after cleansing the code. I will try also to update the repository with clean code.
Hello, I have found the cause of the issue. It is related to the dimension of the last batch. When the last batch contains less than "batch size" items, the training of a LSTM-based network fails. The training does not fail when the last batch is kept during the training of a convolutional neural network (resnet18 in my case).
Contrary to what I said, the LSTM training fails when running both on a single GPU and on multiple GPUs. I was able to replicate the issue using the latest versions of ECVL and EDDL, both cudnn-enabled and not.