EasyEspnet
EasyEspnet copied to clipboard
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Hi, when I try the demo in docker, it appeared this problem.
root@Oision-Legion-R7000P2021H:~/EasyEspnet# python train.py --root_path data/an4/asr1/ --dataset an4
2022-03-28 03:29:05,274 (utils:21) WARNING: Skip DEBUG/INFO messages
2022-03-28 03:29:05,349 (train:179) WARNING: ngpu: 1
2022-03-28 03:29:06,526 (data_load:94) WARNING: #Train Json data/an4/asr1/dump/train_nodev/deltafalse/data.json: 848
2022-03-28 03:29:06,526 (data_load:95) WARNING: #Dev Json data/an4/asr1/dump/train_dev/deltafalse/data.json: 100
2022-03-28 03:29:06,526 (data_load:96) WARNING: #Test Json data/an4/asr1/dump/test/deltafalse/data.json: 130
2022-03-28 03:38:48,454 (train:301) WARNING: Total parameter of the model = 27181116
2022-03-28 03:38:48,455 (train:305) WARNING: Trainable parameter of the model = 27181116
Traceback (most recent call last):
File "train.py", line 315, in <module>
train(dataloaders, model, optimizer, save_path)
File "train.py", line 107, in train
train_stats = train_epoch(train_loader, model, optimizer)
File "train.py", line 55, in train_epoch
loss = model(fbank, seq_lens, tokens).mean() # / self.accum_grad
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/opt/espnet/espnet/nets/pytorch_backend/e2e_asr_transformer.py", line 178, in forward
hs_pad, hs_mask = self.encoder(xs_pad, src_mask)
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/opt/espnet/espnet/nets/pytorch_backend/transformer/encoder.py", line 298, in forward
xs, masks = self.embed(xs, masks)
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/opt/espnet/espnet/nets/pytorch_backend/transformer/subsampling.py", line 75, in forward
x = self.conv(x)
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Looks like some environment issues. Are you using my docker? Since root@Oision-Legion-R7000P2021H doesn't seem to be inside the docker. If so, what kind of GPU and CUDA version of your machine?
Yes,it's a environment issue. I'm using your docker. I check the pytorch version and find my CUDA version(11.4) is not suit for the pytorch version in the docker. So, I try to update the Pytorch version and it seems to be work.
It can start to train but appeared this error RuntimeError: Unable to find a valid cuDNN algorithm to run convolution, I search this error code on the Internet, it may happen when the GPU Memory-Usage is full. I try to reduce the batch size in data_load.py but it still has this error.
- GPU: NVIDIA GeForce RTX 3060 Laptop
- GPU Total Memory: 6144 MB
It seems that your GPU is not suitable for training speech tasks. Honestly speaking, speech tasks are really consuming hardware resources and we are doing our experiments on Microsoft Azure with huge numbers of GPUs. I remember we are using 8 V100 GPUs to train it. So I guess your machine cannot work. However, our docker environment can help you setup EspNet environment quickly. Thus, you can do your own experiments.
Oh, OK. Thank you!