trt_pose RuntimeError during training

Dear Nvidia AI-IOT team,

I've been trying to train the models using the provided train.py script and the coco dataset downloaded through the provided shell script. When Saving checkpoint to densenet121_baseline_att_224x224_A.json.log/epoch_0.pth, I instantly get the following error (0% progress) :

  File "train.py", line 162, in <module>
    scaled_loss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution (try_all at /pytorch/aten/src/ATen/native/cudnn/Conv.cpp:693)

Would you have any pointers on how to fix this ?

Fyi, I am running this script from a docker container based on 10.1-devel-ubuntu18.04, with the following package/module versions:

cuda 10.1
cudNN 7.6.4.38-1
torch 1.5.1+cu101
torchvision 0.6.1+cu101
latest apex release (with the cpp and cuda extensions) I have also tested these versions from python.

Thanks in advance !

Oliver

Jul 02 '20 16:07 OliverGuy

Setting torch.backends.cudnn.benchmark = True at the beginning of the script fixed it for me.

Jul 09 '20 12:07 OliverGuy

Hi OliverGuy,

Apologies for the delayed response. Thanks for sharing!

Glad to hear you got past this. Were you able to train successfully?

Best, John

Jul 15 '20 10:07 jaybdub

Hi John,

No problem at all, and thanks for checking back ! I was indeed able to train, and I'm now working on crafting my own coco dataset.

Regards, Oliver

Jul 15 '20 10:07 OliverGuy

trt_pose trt_pose copied to clipboard

RuntimeError during training

trt_pose
trt_pose copied to clipboard