YOLOX icon indicating copy to clipboard operation
YOLOX copied to clipboard

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Open yjcreation opened this issue 2 years ago • 3 comments

when i train the script by python tools/train.py -f exps/example/yolox_voc/yolox_voc_s.py -d 0 -b 8 --fp16 -o -c yolox_s.pth, I meet the problem:

2022-06-06 16:24:43 | ERROR | yolox.core.launch:98 - An error has been caught in function 'launch', process 'MainProcess' (7800), thread 'MainThread' (7924): Traceback (most recent call last):

File "tools\train.py", line 140, in args=(exp, args), │ └ Namespace(batch_size=4, cache=False, ckpt='yolox_s.pth', devices=0, dist_backend='nccl', dist_url=None, exp_file='exps/exampl... └ ╒═══════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════...

File "d:\e\spore\yolox\yolox\core\launch.py", line 98, in launch main_func(*args) │ └ (╒═══════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════... └ <function main at 0x0000027189606378>

File "tools\train.py", line 117, in main trainer.train() │ └ <function Trainer.train at 0x0000027189A686A8> └ <yolox.core.trainer.Trainer object at 0x0000027189A74828>

File "d:\e\spore\yolox\yolox\core\trainer.py", line 76, in train self.train_in_epoch() │ └ <function Trainer.train_in_epoch at 0x0000027189A68D90> └ <yolox.core.trainer.Trainer object at 0x0000027189A74828>

File "d:\e\spore\yolox\yolox\core\trainer.py", line 85, in train_in_epoch self.train_in_iter() │ └ <function Trainer.train_in_iter at 0x0000027189A68E18> └ <yolox.core.trainer.Trainer object at 0x0000027189A74828>

File "d:\e\spore\yolox\yolox\core\trainer.py", line 91, in train_in_iter self.train_one_iter() │ └ <function Trainer.train_one_iter at 0x0000027189A68EA0> └ <yolox.core.trainer.Trainer object at 0x0000027189A74828>

File "d:\e\spore\yolox\yolox\core\trainer.py", line 110, in train_one_iter self.scaler.scale(loss).backward() │ │ │ └ tensor(10.1536, device='cuda:0', grad_fn=<AddBackward0>) │ │ └ <function GradScaler.scale at 0x00000271FE5FED90> │ └ <torch.cuda.amp.grad_scaler.GradScaler object at 0x0000027189A74860> └ <yolox.core.trainer.Trainer object at 0x0000027189A74828>

File "D:\F\Anaconda3\envs\yolox\lib\site-packages\torch_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) │ │ │ │ │ │ │ └ None │ │ │ │ │ │ └ False │ │ │ │ │ └ None │ │ │ │ └ None │ │ │ └ tensor([665425.], device='cuda:0', grad_fn=<MulBackward0>) │ │ └ <function backward at 0x00000271FEACE1E0> │ └ <module 'torch.autograd' from 'D:\F\Anaconda3\envs\yolox\lib\site-packages\torch\autograd\init.py'> └ <module 'torch' from 'D:\F\Anaconda3\envs\yolox\lib\site-packages\torch\init.py'>

File "D:\F\Anaconda3\envs\yolox\lib\site-packages\torch\autograd_init_.py", line 156, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([4, 128, 20, 20], dtype=torch.half, device='cuda', requires_grad=True) net = torch.nn.Conv2d(128, 1, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().half() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()

ConvolutionParams data_type = CUDNN_DATA_HALF padding = [0, 0, 0] stride = [1, 1, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 00000274B9F17FB0 type = CUDNN_DATA_HALF nbDims = 4 dimA = 4, 128, 20, 20, strideA = 51200, 400, 20, 1, output: TensorDescriptor 00000274B9F16F80 type = CUDNN_DATA_HALF nbDims = 4 dimA = 4, 1, 20, 20, strideA = 400, 400, 20, 1, weight: FilterDescriptor 000002747033C530 type = CUDNN_DATA_HALF tensor_format = CUDNN_TENSOR_NCHW nbDims = 4 dimA = 1, 128, 1, 1, Pointer addresses: input: 00000009550AC000 output: 0000000716DFB000 weight: 00000007167FFA00

yjcreation avatar Jun 06 '22 08:06 yjcreation

From your log:

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 128, 20, 20], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 1, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

If you reproduce such an error again, try raise issue in torch plz. I google your error log, maybe you could try to solve this by setting device id (export CUDA_VISIBLE_DEVICE)

FateScript avatar Jun 06 '22 09:06 FateScript

I set import os os.environ['CUDA_VISIBLE_DEVICES'] = '0', it reproduces such an error again.

yjcreation avatar Jun 07 '22 12:06 yjcreation

+1

It did not reproduce the error when I run the shared code separately.

HilmiiKumdakci avatar Jul 18 '22 20:07 HilmiiKumdakci

Removing the ‘-o’ option will solve this problem.

x-yy0 avatar Feb 26 '23 01:02 x-yy0