agent57_pytorch
agent57_pytorch copied to clipboard
cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
I am trying to run your code on a fresh install of Ubuntu 20.04 with Python 3.9.5, and CUDA 11.6 / cuDNN 8.3.2, but when executing main.py the following cuDNN error results:
$ python main.py
2022-01-21 16:02:17,793 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
(pid=36888) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36888) [Powered by Stella]
(pid=36874) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36874) [Powered by Stella]
(pid=36881) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36881) [Powered by Stella]
(pid=36885) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36885) [Powered by Stella]
(pid=36882) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36882) [Powered by Stella]
(pid=36875) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36875) [Powered by Stella]
====================================================================================================
Traceback (most recent call last):
File "/home/nate/Desktop/Atom/agent57_pytorch/main.py", line 267, in <module>
main(parser.parse_args())
File "/home/nate/Desktop/Atom/agent57_pytorch/main.py", line 144, in main
in_q_weight, ex_q_weight, embed_weight, trained_lifelong_weight, indices, priorities, in_q_loss, ex_q_loss, embed_loss, lifelong_loss = ray.get(finished_learner[0])
File "/home/nate/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/nate/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1495, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::Learner.update_network() (pid=36888, ip=192.168.137.71)
File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor
File "/home/nate/miniconda3/lib/python3.9/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/home/nate/Desktop/Atom/agent57_pytorch/learner.py", line 262, in update_network
priorities, in_q_loss, ex_q_loss = self.qnet_update(weights, segments)
File "/home/nate/Desktop/Atom/agent57_pytorch/learner.py", line 308, in qnet_update
ex_target_qvalues = self.get_qvalues(self.ex_target_q_network, ex_h0, ex_c0)
File "/home/nate/Desktop/Atom/agent57_pytorch/learner.py", line 371, in get_qvalues
_, (h, c) = q_network(self.states[t],
File "/home/nate/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/nate/Desktop/Atom/agent57_pytorch/model.py", line 99, in forward
x, states = self.lstm(x.unsqueeze(0), states)
File "/home/nate/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/nate/miniconda3/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 679, in forward
result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Have you encountered an error like this during development? Are you using an older version of CUDA / cuDNN? Please let me know if you have any suggestions.
Based on other, similar issues I think the problem is that not all tensors are being sent to the GPU when Cuda is available. I'm trying to find where ".to(self.device)" might be missing. Could someone confirm whether they can run on Cuda without changes, or was this only run on CPU?
CPU works fine, but possibly slower than GPU...
@Obliman Sorry for late reply. At first, thanks for your question. I was able to run my code with CUDA 10.0. I hope this helps!