neural_chat
neural_chat copied to clipboard
CUDNN_STATUS_EXECUTION_FAILED while training VHRED
Commend I use is python model/train.py --data=reddit_casual --model=VHRED --batch_size=2
Then it report CUDNN_STATUS_EXECUTION_FAILED
Training Start!
0%| | 0/43573 [00:00<?, ?it/s]
Traceback (most recent call last):
File "model/train.py", line 91, in <module>
solver.train()
File "/projects/da33/tao/test-project/neural_chat/model/utils/time_track.py", line 18, in timed
result = method(*args, **kwargs)
File "/projects/da33/tao/test-project/neural_chat/model/solver.py", line 905, in train
mode='train', kl_mult=kl_mult)
File "/projects/da33/tao/test-project/neural_chat/model/solver.py", line 1079, in _process_batch
extra_context_inputs=extra_context_inputs)
File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/projects/da33/tao/test-project/neural_chat/model/models.py", line 361, in forward
sentence_length)
File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/projects/da33/tao/test-project/neural_chat/model/layers/encoder.py", line 127, in forward
outputs, hidden = self.rnn(rnn_input, hidden)
File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/modules/rnn.py", line 192, in forward
output, hidden = func(input, self.all_weights, hx, batch_sizes)
File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/_functions/rnn.py", line 323, in forward
return func(input, *fargs, **fkwargs)
File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/_functions/rnn.py", line 275, in forward
train, dropout_seed, dropout_state)
File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 44, in init_dropout_state
if dropout_p != 0 else None
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED
My environment: gpu: NVIDIA Tesla V100 python 3.6.8 cuda 10.1 torch 0.4.0
This problem spends me four days. Do you have any idea how to solve it?
Hi I'm facing the same problem, do you have any idea about it ?
Thanks
Hi,I fixed the problem
Setting: RTX2080 , CUDA9.0, pytorch 0.4.1 with cuda9.0 ( use a docker image from the pytorch official docker hub)
Solution: set torch.backends.cudnn.enabled=False in the code, then run the code