MSRF-Net_PyTorch icon indicating copy to clipboard operation
MSRF-Net_PyTorch copied to clipboard

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED when loss.backward() in train.py

Open chiendoanngoc opened this issue 3 years ago • 3 comments

Thanks for your great work, your code is so much cleaner that I could easily understand. I just had an error raised in train.py when loss.backward(). The error is [RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED. Have you ever seen this before and do you have any suggestion to fix this? Thanks a lot!

chiendoanngoc avatar Jan 08 '22 10:01 chiendoanngoc

Hi @chiendoanngoc ! You're welcome! I've faced the same issue and I fixed that by using another version of PyTorch. Actually I'm using version: 1.9.0+cu111 however it depends on your CUDA version. You can find all previous pytorch versions here

I just changed the README file to avoid confusion about the Pytorch version.

amlarraz avatar Jan 10 '22 16:01 amlarraz

HELLO, @amlarraz @chiendoanngoc I had changed the torch version to 1.9.0+cu111 but I still got the same error. I used Colab as working environment.

  cpuset_checked))
Logdir: ./logs/combination-2_7_2022-18h40m33s
Train epoch: 1:   0%|          | 0/1113 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-1-13d8a8766d4b>](https://localhost:8080/#) in <module>()
     60         loss = criterion(pred_3, pred_canny, pred_1, pred_2, msk, canny_label)
     61         loss = loss/accumulation_steps
---> 62         loss.backward()
     63         # accumulative gradient
     64         if (i + 1) % accumulation_steps == 0:  # Wait for several backward steps

1 frames
[/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    147     Variable._execution_engine.run_backward(
    148         tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 149         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
    150 
    151 

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED```

import torch torch.version

1.9.0+cu111

mk-hassan avatar Jul 02 '22 18:07 mk-hassan

Hi @Twixii99, which CUDA version are you using? Remember that the PyTorch version depends on the CUDA version you're using. Ifyou're using this PyTorch version and the colab enviroment is using a different CUDA version than 11.1 PyTorch will give you some errors. To know which CUDA version you're using you can run the command: !nvidia-smi in one cell. To choose the correct PyTorch version according with your CUDA version you can visit this page.

amlarraz avatar Jul 04 '22 07:07 amlarraz