eddl LSTM training fail on single GPU, but not with multiple GPUs

With the latest versions of EDDL (1.2.0) and ECVL (1.1.0), I get a CUDA error when training the model using a single GPU. I have no problems when using 2 or 4 GPUs. The error occurs systematically at the beginning of the third epoch and does not seem to depend on the batch size. It does not depend on the memory consumption parameter (“full_mem”, “mid_mem” or “low_mem”), I tried all of them. The GPU is a nVidia V100. With previous versions of the libraries, this error did not occur (but I was using a different GPU).

.Traceback (most recent call last):
  File "C01_2_rec_mod_edll.py", line 98, in <module>
    fire.Fire({
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "C01_2_rec_mod_edll.py", line 46, in train
    rec_mod.train()
  File "/mnt/datasets/uc5/UC5_pipeline_forked/src/eddl_lib/recurrent_module.py", line 289, in train
    eddl.train_batch(rnn, [cnn_visual, thresholded], [Y])
  File "/root/miniconda3/envs/eddl/lib/python3.8/site-packages/pyeddl/eddl.py", line 435, in train_batch
    return _eddl.train_batch(net, in_, out)
RuntimeError: [CUDA ERROR]: invalid argument (1) raised in delete_tensor | (check_cuda)

The code is not yet available on the repository, please let me know what details I can add.

Apr 11 '22 05:04 thistlillo

Can you send a minimal script to debug it? With that information, I'm a bit lost

Apr 11 '22 15:04 salvacarrion

Hello @thistlillo, we have been debugging this issue but we have not been able to reproduce the problem. Our tests surpass five epochs using both configurations with 1 and 2 GPUs. Do you think we can help you in a virtual meeting?

May 09 '22 17:05 bernia

Hello @bernia and sorry for this late reply, but I did not receive any notification from github about your reply. I have now installed version 1.3 and next week I will perform some more tests. I will report back here.

The code published for UC5 is not up-to-date, now it uses also the ECVL dataloader. I work on a fork that periodically join after cleansing the code. I will try also to update the repository with clean code.

May 19 '22 08:05 thistlillo

Hello, I have found the cause of the issue. It is related to the dimension of the last batch. When the last batch contains less than "batch size" items, the training of a LSTM-based network fails. The training does not fail when the last batch is kept during the training of a convolutional neural network (resnet18 in my case).

Contrary to what I said, the LSTM training fails when running both on a single GPU and on multiple GPUs. I was able to replicate the issue using the latest versions of ECVL and EDDL, both cudnn-enabled and not.

May 25 '22 13:05 thistlillo

eddl eddl copied to clipboard

LSTM training fail on single GPU, but not with multiple GPUs

eddl
eddl copied to clipboard