trt_pose
trt_pose copied to clipboard
RuntimeError during training
Dear Nvidia AI-IOT team,
I've been trying to train the models using the provided train.py
script and the coco dataset downloaded through the provided shell script.
When Saving checkpoint to densenet121_baseline_att_224x224_A.json.log/epoch_0.pth
, I instantly get the following error (0% progress) :
File "train.py", line 162, in <module>
scaled_loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution (try_all at /pytorch/aten/src/ATen/native/cudnn/Conv.cpp:693)
Would you have any pointers on how to fix this ?
Fyi, I am running this script from a docker container based on 10.1-devel-ubuntu18.04
, with the following package/module versions:
- cuda
10.1
- cudNN
7.6.4.38-1
- torch
1.5.1+cu101
- torchvision
0.6.1+cu101
- latest apex release (with the cpp and cuda extensions) I have also tested these versions from python.
Thanks in advance !
Oliver
Setting torch.backends.cudnn.benchmark = True
at the beginning of the script fixed it for me.
Hi OliverGuy,
Apologies for the delayed response. Thanks for sharing!
Glad to hear you got past this. Were you able to train successfully?
Best, John
Hi John,
No problem at all, and thanks for checking back ! I was indeed able to train, and I'm now working on crafting my own coco dataset.
Regards, Oliver