trt_pose icon indicating copy to clipboard operation
trt_pose copied to clipboard

RuntimeError during training

Open OliverGuy opened this issue 4 years ago • 3 comments

Dear Nvidia AI-IOT team,

I've been trying to train the models using the provided train.py script and the coco dataset downloaded through the provided shell script. When Saving checkpoint to densenet121_baseline_att_224x224_A.json.log/epoch_0.pth, I instantly get the following error (0% progress) :

  File "train.py", line 162, in <module>
    scaled_loss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution (try_all at /pytorch/aten/src/ATen/native/cudnn/Conv.cpp:693)

Would you have any pointers on how to fix this ?

Fyi, I am running this script from a docker container based on 10.1-devel-ubuntu18.04, with the following package/module versions:

  • cuda 10.1
  • cudNN 7.6.4.38-1
  • torch 1.5.1+cu101
  • torchvision 0.6.1+cu101
  • latest apex release (with the cpp and cuda extensions) I have also tested these versions from python.

Thanks in advance !

Oliver

OliverGuy avatar Jul 02 '20 16:07 OliverGuy

Setting torch.backends.cudnn.benchmark = True at the beginning of the script fixed it for me.

OliverGuy avatar Jul 09 '20 12:07 OliverGuy

Hi OliverGuy,

Apologies for the delayed response. Thanks for sharing!

Glad to hear you got past this. Were you able to train successfully?

Best, John

jaybdub avatar Jul 15 '20 10:07 jaybdub

Hi John,

No problem at all, and thanks for checking back ! I was indeed able to train, and I'm now working on crafting my own coco dataset.

Regards, Oliver

OliverGuy avatar Jul 15 '20 10:07 OliverGuy