agent57_pytorch cuDNN error: CUDNN_STATUS_EXECUTION

I am trying to run your code on a fresh install of Ubuntu 20.04 with Python 3.9.5, and CUDA 11.6 / cuDNN 8.3.2, but when executing main.py the following cuDNN error results:

$ python main.py 
2022-01-21 16:02:17,793	INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
(pid=36888) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36888) [Powered by Stella]
(pid=36874) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36874) [Powered by Stella]
(pid=36881) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36881) [Powered by Stella]
(pid=36885) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36885) [Powered by Stella]
(pid=36882) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36882) [Powered by Stella]
(pid=36875) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=36875) [Powered by Stella]
====================================================================================================
Traceback (most recent call last):
  File "/home/nate/Desktop/Atom/agent57_pytorch/main.py", line 267, in <module>
    main(parser.parse_args())
  File "/home/nate/Desktop/Atom/agent57_pytorch/main.py", line 144, in main
    in_q_weight, ex_q_weight, embed_weight, trained_lifelong_weight, indices, priorities, in_q_loss, ex_q_loss, embed_loss, lifelong_loss = ray.get(finished_learner[0])
  File "/home/nate/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/home/nate/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1495, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::Learner.update_network() (pid=36888, ip=192.168.137.71)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor
  File "/home/nate/miniconda3/lib/python3.9/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/nate/Desktop/Atom/agent57_pytorch/learner.py", line 262, in update_network
    priorities, in_q_loss, ex_q_loss = self.qnet_update(weights, segments)
  File "/home/nate/Desktop/Atom/agent57_pytorch/learner.py", line 308, in qnet_update
    ex_target_qvalues = self.get_qvalues(self.ex_target_q_network, ex_h0, ex_c0)
  File "/home/nate/Desktop/Atom/agent57_pytorch/learner.py", line 371, in get_qvalues
    _, (h, c) = q_network(self.states[t],
  File "/home/nate/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nate/Desktop/Atom/agent57_pytorch/model.py", line 99, in forward
    x, states = self.lstm(x.unsqueeze(0), states)
  File "/home/nate/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nate/miniconda3/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 679, in forward
    result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Have you encountered an error like this during development? Are you using an older version of CUDA / cuDNN? Please let me know if you have any suggestions.

Jan 21 '22 22:01 nhansendev

Based on other, similar issues I think the problem is that not all tensors are being sent to the GPU when Cuda is available. I'm trying to find where ".to(self.device)" might be missing. Could someone confirm whether they can run on Cuda without changes, or was this only run on CPU?

CPU works fine, but possibly slower than GPU...

Jan 27 '22 15:01 nhansendev

@Obliman Sorry for late reply. At first, thanks for your question. I was able to run my code with CUDA 10.0. I hope this helps!

Feb 21 '22 11:02 yuta0821

agent57_pytorch agent57_pytorch copied to clipboard

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

agent57_pytorch
agent57_pytorch copied to clipboard