EasyEspnet icon indicating copy to clipboard operation
EasyEspnet copied to clipboard

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Open Oision-hub opened this issue 3 years ago • 4 comments
trafficstars

Hi, when I try the demo in docker, it appeared this problem.

root@Oision-Legion-R7000P2021H:~/EasyEspnet# python train.py --root_path data/an4/asr1/ --dataset an4
2022-03-28 03:29:05,274 (utils:21) WARNING: Skip DEBUG/INFO messages
2022-03-28 03:29:05,349 (train:179) WARNING: ngpu: 1
2022-03-28 03:29:06,526 (data_load:94) WARNING: #Train Json data/an4/asr1/dump/train_nodev/deltafalse/data.json: 848
2022-03-28 03:29:06,526 (data_load:95) WARNING: #Dev Json data/an4/asr1/dump/train_dev/deltafalse/data.json: 100
2022-03-28 03:29:06,526 (data_load:96) WARNING: #Test Json data/an4/asr1/dump/test/deltafalse/data.json: 130
2022-03-28 03:38:48,454 (train:301) WARNING: Total parameter of the model = 27181116
2022-03-28 03:38:48,455 (train:305) WARNING: Trainable parameter of the model = 27181116
Traceback (most recent call last):
  File "train.py", line 315, in <module>
    train(dataloaders, model, optimizer, save_path)
  File "train.py", line 107, in train
    train_stats = train_epoch(train_loader, model, optimizer)
  File "train.py", line 55, in train_epoch
    loss = model(fbank, seq_lens, tokens).mean() # / self.accum_grad
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/espnet/espnet/nets/pytorch_backend/e2e_asr_transformer.py", line 178, in forward
    hs_pad, hs_mask = self.encoder(xs_pad, src_mask)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/espnet/espnet/nets/pytorch_backend/transformer/encoder.py", line 298, in forward
    xs, masks = self.embed(xs, masks)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/espnet/espnet/nets/pytorch_backend/transformer/subsampling.py", line 75, in forward
    x = self.conv(x)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Oision-hub avatar Mar 28 '22 03:03 Oision-hub

Looks like some environment issues. Are you using my docker? Since root@Oision-Legion-R7000P2021H doesn't seem to be inside the docker. If so, what kind of GPU and CUDA version of your machine?

jindongwang avatar Mar 28 '22 10:03 jindongwang

Yes,it's a environment issue. I'm using your docker. I check the pytorch version and find my CUDA version(11.4) is not suit for the pytorch version in the docker. So, I try to update the Pytorch version and it seems to be work. It can start to train but appeared this error RuntimeError: Unable to find a valid cuDNN algorithm to run convolution, I search this error code on the Internet, it may happen when the GPU Memory-Usage is full. I try to reduce the batch size in data_load.py but it still has this error.

  • GPU: NVIDIA GeForce RTX 3060 Laptop
  • GPU Total Memory: 6144 MB

Oision-hub avatar Mar 29 '22 02:03 Oision-hub

It seems that your GPU is not suitable for training speech tasks. Honestly speaking, speech tasks are really consuming hardware resources and we are doing our experiments on Microsoft Azure with huge numbers of GPUs. I remember we are using 8 V100 GPUs to train it. So I guess your machine cannot work. However, our docker environment can help you setup EspNet environment quickly. Thus, you can do your own experiments.

jindongwang avatar Mar 29 '22 02:03 jindongwang

Oh, OK. Thank you!

Oision-hub avatar Mar 29 '22 02:03 Oision-hub