apex icon indicating copy to clipboard operation
apex copied to clipboard

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Open franciscorubin opened this issue 6 years ago • 11 comments

I get the following error every time I try to do a forward call with apex:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-20-c83117740453> in <module>
      1 #%%pixie_debugger
      2 while True:
----> 3     train(verbose=False, optimize_memory=True, optimize_feature=False)
      4     with open('temp/memory.pkl', 'wb') as f:
      5         pickle.dump(net.memory_model.memory, f)

<ipython-input-19-7e6a3b51254d> in train(verbose, optimize_memory, optimize_feature)
     11         optimizer_both.zero_grad()
     12 
---> 13         similarities = net(batch_data)
     14 
     15         values, indices = similarities.max(1)

~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

<ipython-input-13-fa199304f042> in forward(self, images)
     23         queries = self.feature_model(images)
     24         #print(queries)
---> 25         similarities = self.memory_model(queries)
     26 #        print(sorted(similarities, reverse=True))
     27         return similarities

~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

<ipython-input-12-ea8dad5c6180> in forward(self, queries)
     44 
     45     def forward(self, queries):
---> 46         sim_vector = self.get_similarity_vectors(queries)
     47         return sim_vector

<ipython-input-12-ea8dad5c6180> in get_similarity_vectors(self, queries)
     39 
     40     def get_similarity_vectors(self, queries):
---> 41         similarity = self.apply_combined(queries, self.memory, self.head_model)
     42 #        print(similarity)
     43         return nn.functional.log_softmax(similarity * 10000) # multiply because of rounding errors

<ipython-input-12-ea8dad5c6180> in apply_combined(self, x, y, func)
     34         assert x.shape == y.shape
     35 
---> 36         res = func(x, y)
     37         res = res.view(n, m)
     38         return res

~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/Projects/Personal/Kaggle/humpwin/pancho111203/siamese/model.py in forward(self, x, y)
    131         out = nn.functional.relu(out, inplace=True)
    132         out = out.permute((0, 3, 1, 2))
--> 133         out = self.conv2(out)
    134         out = out.view(batch_size, n_features)
    135 

~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py in forward(self, input)
    318     def forward(self, input):
    319         return F.conv2d(input, self.weight, self.bias, self.stride,
--> 320                         self.padding, self.dilation, self.groups)
    321 
    322 

~/miniconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py in wrapper(*args, **kwargs)
     24                                      args,
     25                                      kwargs)
---> 26         return orig_fn(*new_args, **kwargs)
     27     return wrapper
     28 

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

CUDNN logs: https://gist.github.com/pancho111203/3e91f0b46ab0be3b04f1edc9c1405684

franciscorubin avatar Dec 31 '18 18:12 franciscorubin

This might be a cudnn issue, especially if you're using cudnn 7.2. Try

>>> import torch
>>> torch.backends.cudnn.version()

Upgrading your cudnn version may fix it: https://github.com/NVIDIA/apex/issues/78#issuecomment-440301134

Container options are

  • our Pytorch containers from NGC (which come with Apex preinstalled)
  • the upstream devel Dockerfiles, e.g. docker pull pytorch/pytorch:nightly-devel-cuda10.0-cudnn7, (in which you can install Apex yourself with the usual git clone, python setup.py install --cuda_ext --cpp_ext).

mcarilli avatar Jan 10 '19 23:01 mcarilli

I tried updating and unfortunately the error persists. The command you mentioned outputs 7401.

franciscorubin avatar Jan 11 '19 21:01 franciscorubin

just having a similar issue : ` 318 def forward(self, input): 319 return F.conv2d(input, self.weight, self.bias, self.stride, --> 320 self.padding, self.dilation, self.groups) 321 322

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED`

I'm running on windows 10 using cudnn 7.4.2 + cuda 10 . Are the others having this problem running on windows or on linux ?

P.s. : i am using an NVIDIA TITAN RTX

njean78 avatar Jan 17 '19 08:01 njean78

@njean78 I am running Linux Ubuntu 16.04, so it looks like the error is os-independent.

franciscorubin avatar Jan 17 '19 09:01 franciscorubin

solved my issue by installing pytorch for cuda 10 (got it from https://pytorch.org/). I was probably using the one for cuda 9...

njean78 avatar Jan 17 '19 09:01 njean78

I tried updating and unfortunately the error persists. The command you mentioned outputs 7401.

@pancho111203 Since you've got cuda 10 on bare metal (meaning your system has the cuda 10 driver) you should be using Pytorch for cuda 10. When you say "I tried updating" do you mean you only updated cudnn, or did you try running in one of the cuda 10 containers I mentioned?

mcarilli avatar Jan 28 '19 20:01 mcarilli

if you runing pytorch in docker, you shuld know that: https://github.com/NVIDIA/tacotron2/issues/109

moyans avatar Aug 07 '19 03:08 moyans

  • Ubuntu 16.04
  • cuda 10
  • pytorch 1.3 (preview)
  • cudnn version : 7602

Still having this problem RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Can anyone give me some help? Thanks a lot!

zhixuanli avatar Sep 29 '19 08:09 zhixuanli

Actually this will happen on gpu card 3, and it'll be fine on the other gpu cards.

I only use 1 gpu every time

zhixuanli avatar Sep 29 '19 08:09 zhixuanli

@zhixuanli Which GPUs are you using and do you have a reproducible code snippet? Was apex installed successfully?

ptrblck avatar Sep 30 '19 14:09 ptrblck

check the file path. It worked for me.

larifreitas avatar May 20 '22 16:05 larifreitas