RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Open Jason-kid opened this issue 6 years ago • 0 comments

Hi! When I ran ASR model training stage (stage 4) with 8 * 1080 Ti, I got error as follows:

Original utterance num: 281241
Removed 54 utterances (threshold)
Original utterance num: 2703
Removed 61 utterances (threshold)
5%|▌         | 15240/281187 [05:13<1:39:01, 44.76it/s]
Traceback (most recent call last):
  File "/asr/neural_sp-master/examples/librispeech/s5/../../../neural_sp/bin/asr/train.py", line 533, in <module>
    save_path = pr.runcall(main)
  File "/asr/miniconda/lib/python3.7/cProfile.py", line 121, in runcall
    return func(*args, **kw)
  File "/asr/neural_sp-master/examples/librispeech/s5/../../../neural_sp/bin/asr/train.py", line 367, in main
    teacher=teacher, teacher_lm=teacher_lm)
  File "/asr/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/asr/miniconda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/asr/miniconda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/asr/miniconda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/asr/miniconda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/asr/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/asr/neural_sp-master/neural_sp/models/seq2seq/speech2text.py", line 426, in forward
    loss, reporter = self._forward(batch, task, reporter, teacher, teacher_lm)
  File "/asr/neural_sp-master/neural_sp/models/seq2seq/speech2text.py", line 461, in _forward
    enc_outs = self.encode(batch['xs'], 'all', flip=flip)
  File "/asr/neural_sp-master/neural_sp/models/seq2seq/speech2text.py", line 568, in encode
    enc_outs = self.enc(xs, xlens, task.split('.')[0])
  File "/asr/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/asr/neural_sp-master/neural_sp/models/seq2seq/encoders/rnn.py", line 300, in forward
    xs = self.padding(xs, xlens, self.rnn[l])
  File "/asr/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/asr/neural_sp-master/neural_sp/models/seq2seq/encoders/rnn.py", line 378, in forward
    xs, _ = rnn(xs, hx=None)
  File "/asr/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/asr/miniconda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 182, in forward
    self.num_layers, self.dropout, self.training, self.bidirectional)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Anyone know what is the problem?

My environment: system ubuntu 16.04 gpu NVIDIA GTX 1080 Ti python 3.7.4 cuda 9.0.176 torch 1.0.0 cudnn 7.0.5

However, it worked when I used only one GPU. Can anyone help me resolve this issue? Thank you!

Sep 18 '19 01:09 Jason-kid