apex
apex copied to clipboard
cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
I get the following error every time I try to do a forward call with apex:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-20-c83117740453> in <module>
1 #%%pixie_debugger
2 while True:
----> 3 train(verbose=False, optimize_memory=True, optimize_feature=False)
4 with open('temp/memory.pkl', 'wb') as f:
5 pickle.dump(net.memory_model.memory, f)
<ipython-input-19-7e6a3b51254d> in train(verbose, optimize_memory, optimize_feature)
11 optimizer_both.zero_grad()
12
---> 13 similarities = net(batch_data)
14
15 values, indices = similarities.max(1)
~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
--> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
<ipython-input-13-fa199304f042> in forward(self, images)
23 queries = self.feature_model(images)
24 #print(queries)
---> 25 similarities = self.memory_model(queries)
26 # print(sorted(similarities, reverse=True))
27 return similarities
~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
--> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
<ipython-input-12-ea8dad5c6180> in forward(self, queries)
44
45 def forward(self, queries):
---> 46 sim_vector = self.get_similarity_vectors(queries)
47 return sim_vector
<ipython-input-12-ea8dad5c6180> in get_similarity_vectors(self, queries)
39
40 def get_similarity_vectors(self, queries):
---> 41 similarity = self.apply_combined(queries, self.memory, self.head_model)
42 # print(similarity)
43 return nn.functional.log_softmax(similarity * 10000) # multiply because of rounding errors
<ipython-input-12-ea8dad5c6180> in apply_combined(self, x, y, func)
34 assert x.shape == y.shape
35
---> 36 res = func(x, y)
37 res = res.view(n, m)
38 return res
~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
--> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
~/Projects/Personal/Kaggle/humpwin/pancho111203/siamese/model.py in forward(self, x, y)
131 out = nn.functional.relu(out, inplace=True)
132 out = out.permute((0, 3, 1, 2))
--> 133 out = self.conv2(out)
134 out = out.view(batch_size, n_features)
135
~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
--> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py in forward(self, input)
318 def forward(self, input):
319 return F.conv2d(input, self.weight, self.bias, self.stride,
--> 320 self.padding, self.dilation, self.groups)
321
322
~/miniconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py in wrapper(*args, **kwargs)
24 args,
25 kwargs)
---> 26 return orig_fn(*new_args, **kwargs)
27 return wrapper
28
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
CUDNN logs: https://gist.github.com/pancho111203/3e91f0b46ab0be3b04f1edc9c1405684
This might be a cudnn issue, especially if you're using cudnn 7.2. Try
>>> import torch
>>> torch.backends.cudnn.version()
Upgrading your cudnn version may fix it: https://github.com/NVIDIA/apex/issues/78#issuecomment-440301134
Container options are
- our Pytorch containers from NGC (which come with Apex preinstalled)
- the upstream devel Dockerfiles, e.g.
docker pull pytorch/pytorch:nightly-devel-cuda10.0-cudnn7
, (in which you can install Apex yourself with the usualgit clone
,python setup.py install --cuda_ext --cpp_ext
).
I tried updating and unfortunately the error persists. The command you mentioned outputs 7401.
just having a similar issue : ` 318 def forward(self, input): 319 return F.conv2d(input, self.weight, self.bias, self.stride, --> 320 self.padding, self.dilation, self.groups) 321 322
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED`
I'm running on windows 10 using cudnn 7.4.2 + cuda 10 . Are the others having this problem running on windows or on linux ?
P.s. : i am using an NVIDIA TITAN RTX
@njean78 I am running Linux Ubuntu 16.04, so it looks like the error is os-independent.
solved my issue by installing pytorch for cuda 10 (got it from https://pytorch.org/). I was probably using the one for cuda 9...
I tried updating and unfortunately the error persists. The command you mentioned outputs 7401.
@pancho111203 Since you've got cuda 10 on bare metal (meaning your system has the cuda 10 driver) you should be using Pytorch for cuda 10. When you say "I tried updating" do you mean you only updated cudnn, or did you try running in one of the cuda 10 containers I mentioned?
if you runing pytorch in docker, you shuld know that: https://github.com/NVIDIA/tacotron2/issues/109
- Ubuntu 16.04
- cuda 10
- pytorch 1.3 (preview)
- cudnn version : 7602
Still having this problem
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Can anyone give me some help? Thanks a lot!
Actually this will happen on gpu card 3, and it'll be fine on the other gpu cards.
I only use 1 gpu every time
@zhixuanli Which GPUs are you using and do you have a reproducible code snippet?
Was apex
installed successfully?
check the file path. It worked for me.