pytorch-deeplab-xception CuDNN error: CUDNN_STATUS_EXECUTION

Hello, I want to train my datasets. However, when I try to run the code, the error occurs as follows: Namespace(backbone='resnet', base_size=513, batch_size=8, checkname='deeplab-resnet', crop_size=513, cuda=True, dataset='pascal', epochs=50, eval_interval=1, freeze_bn=False, ft=False, gpu_ids=[0], loss_type='ce', lr=0.007, lr_scheduler='poly', momentum=0.9, nesterov=False, no_cuda=False, no_val=False, out_stride=16, resume=None, seed=1, start_epoch=0, sync_bn=False, test_batch_size=8, use_balanced_weights=False, use_sbd=False, weight_decay=0.0005, workers=4) Number of images in train: 3184 Number of images in val: 797 Using poly LR Scheduler! Starting Epoch: 0 Total Epoches: 50 0%| | 0/398 [00:00<?, ?it/s] =>Epoches 0, learning rate = 0.0070, previous best = 0.0000 /home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead. warnings.warn(warning.format(ret)) Train loss: 0.288: 1%|▏ | 3/398 [00:03<07:59, 1.21s/it] Traceback (most recent call last): File "train.py", line 305, in <module> main() File "train.py", line 298, in main trainer.training(epoch) File "train.py", line 109, in training loss.backward() File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [13,0,0], thread: [457,0,0] Assertion t >= 0 && t < n_classesfailed. /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [13,0,0], thread: [458,0,0] Assertiont >= 0 && t < n_classesfailed.

Dec 04 '18 04:12 taomrzhang

It seems like the error is in your label. Maybe you should check your label, or you could provide more evidence about how this error comes up.

Dec 04 '18 05:12 jfzhang95

Thanks! But I have altered the number of label, but the error is same.

Train loss: 0.193: 2%|▍ | 7/398 [00:06<06:18, 1.03it/s]Traceback (most recent call last): File "train.py", line 305, in <module> main() File "train.py", line 298, in main trainer.training(epoch) File "train.py", line 109, in training loss.backward() File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Dec 04 '18 05:12 taomrzhang

I did not encounter such problem before. Could you successfully run my default training code in VOC dataset?

Dec 04 '18 06:12 jfzhang95

I have a same problem when I run the default training code in VOC dataset. Have you solved it?

Jan 16 '19 07:01 719637146

Have the same issue. Using poly LR Scheduler! Starting Epoch: 0 Total Epoches: 50 0%| | 0/4179 [00:00<?, ?it/s] =>Epoches 0, learning rate = 0.0070, previous best = 0.0000 /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead. warnings.warn(warning.format(ret)) Traceback (most recent call last): File "train.py", line 301, in <module> main() File "train.py", line 294, in main trainer.training(epoch) File "train.py", line 106, in training loss.backward() File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED Any suggestions?

Feb 24 '19 02:02 krishnadusad

maybe try smaller batch size if your GPU memory is not enough.

Jun 19 '19 11:06 Pyten

recently I meet the same issue, any suggestions?

Dec 03 '19 13:12 ghost

Thanks! But I have altered the number of label, but the error is same. Train loss: 0.193: 2%|▍ | 7/398 [00:06<06:18, 1.03it/s]Traceback (most recent call last): File "train.py", line 305, in main() File "train.py", line 298, in main trainer.training(epoch) File "train.py", line 109, in training loss.backward() File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/image/anaconda3/envs/ajy/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED

did you solve this problem, can give some suggestions? thank you.

Dec 03 '19 13:12 ghost

In my case, I solve the same issue by fixing the error labels of my own dataset.

May 13 '20 02:05 coordxyz

In my case, I solve the same issue by fixing the error labels of my own dataset.

Can you be more specific as to where did you made those changes?

Jun 04 '22 06:06 parthkvv

pytorch-deeplab-xception pytorch-deeplab-xception copied to clipboard

CuDNN error: CUDNN_STATUS_EXECUTION_FAILED

pytorch-deeplab-xception
pytorch-deeplab-xception copied to clipboard