neural_chat icon indicating copy to clipboard operation
neural_chat copied to clipboard

CUDNN_STATUS_EXECUTION_FAILED while training VHRED

Open WilliamsToTo opened this issue 4 years ago • 2 comments

Commend I use is python model/train.py --data=reddit_casual --model=VHRED --batch_size=2 Then it report CUDNN_STATUS_EXECUTION_FAILED

Training Start!
  0%|                                                 | 0/43573 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "model/train.py", line 91, in <module>
    solver.train()
  File "/projects/da33/tao/test-project/neural_chat/model/utils/time_track.py", line 18, in timed
    result = method(*args, **kwargs)
  File "/projects/da33/tao/test-project/neural_chat/model/solver.py", line 905, in train
    mode='train', kl_mult=kl_mult)
  File "/projects/da33/tao/test-project/neural_chat/model/solver.py", line 1079, in _process_batch
    extra_context_inputs=extra_context_inputs)
  File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/projects/da33/tao/test-project/neural_chat/model/models.py", line 361, in forward
    sentence_length)
  File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/projects/da33/tao/test-project/neural_chat/model/layers/encoder.py", line 127, in forward
    outputs, hidden = self.rnn(rnn_input, hidden)
  File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/modules/rnn.py", line 192, in forward
    output, hidden = func(input, self.all_weights, hx, batch_sizes)
  File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/_functions/rnn.py", line 323, in forward
    return func(input, *fargs, **fkwargs)
  File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/nn/_functions/rnn.py", line 275, in forward
    train, dropout_seed, dropout_state)
  File "/projects/da33/tao/test-project/neural_chat/env/lib64/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 44, in init_dropout_state
    if dropout_p != 0 else None
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

My environment: gpu: NVIDIA Tesla V100 python 3.6.8 cuda 10.1 torch 0.4.0

This problem spends me four days. Do you have any idea how to solve it?

WilliamsToTo avatar Oct 31 '20 03:10 WilliamsToTo

Hi I'm facing the same problem, do you have any idea about it ?

Thanks

CaesarWWK avatar Dec 29 '20 16:12 CaesarWWK

Hi,I fixed the problem

Setting: RTX2080 , CUDA9.0, pytorch 0.4.1 with cuda9.0 ( use a docker image from the pytorch official docker hub)

Solution: set torch.backends.cudnn.enabled=False in the code, then run the code

CaesarWWK avatar Dec 30 '20 04:12 CaesarWWK