Fast-BEV icon indicating copy to clipboard operation
Fast-BEV copied to clipboard

CUDA error: device-side assert triggered

Open Rango-Zhang-Hang opened this issue 1 year ago • 6 comments

Thank you for this great work! I followed the instructions and used the nuscenesv1.0 full dataset. But when I run the training code, as I tried multiple times, it always has this error at around epoch 1 [14000/20000]. I was using the provided '.pkl' files to train, so I wonder if anyone also met this problem. I read online that the reason is the inconsistency between the label and the output, but this error appeared during the training process, not at the very first beginning. Thus it is very wired to me.

I attached the report:

,0,0], thread:[9,0,0] Assertion input val >= zero && input val <= one" failed.40/1836opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator): block:00,0], thread: [10,0,0] Assertion input val >= zero && input val <= one" failed.opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block: 00,0], thread:[11,0,0] Assertion "input val >= zero && input val <= one" failed./opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block:[00,0, thread:[12,0,0] ssertion input val >= zero && input val <= one failed.opt/conda/conda-bld/pytorch 1616554799289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block;[00,0l, thread:[13,0,0] Assertion input val >= zero && input val <= one" failed./opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block:;0,0,0], thread:[14,0,0] Assertion input val >= zero && input val <= one" failed./opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block;[00,0l, thread: 15,0,0 Assertion input val >= zero && input val <= one" failed.opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block:0,0,l, thread:[16,0,0] Assertion "input val >= zero && input val <= one" failed.opt/conda/conda-bld/pytorch 1616554790289/work/aten/src/ATen/native/cuda/Loss,cu:102: operator(): block:[0,0,0], thread:[17,0,0] Assertion "input val >= zero && input val <= one" failed.Traceback (most recent call last):
File"tools/train.py",line 279,in <module>
main
File"tools/train.py",line 275,in mainmeta=meta)File"/home nfs/xxx/hang/mmdetection3d/mmdet3d/apis/train.py", line 191, in train model
meta=meta)
File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/apis/train.py", line 159,in train detectorrunner.run(data loaders , cfe.workflow)

File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/torch/nn/modules/module.py3
line 889,in call implresult= self.forward(*input,**kwargs)File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/mmcv/runner/fp16 utils,py"
line 128,in new funcoutput = old func(*new args,**new kwargs)File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/models/detectors/fastbev.py", line 294, in forwardreturn self.forward train(img,img metas,**kwargs)File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/models/detectors/fastbev,py", line 312, in forward train
loss_det = self.bbox head.loss(*x, gt bboxes 3d, gt labels 3d, img metas)File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/mmcv/runner/fp16 utils,py"
line 214,in new funcoutput = old func(*new args,**new kwargs)File "/home nfs/xxx/hang/mmdetection3d/mdet3d/models/dense heads/free anchor3d head.py",line 234,in loss
positive losses.append(self.positive bag loss(matched cls prob, matched box prob))File "/home nfs/xxx/hang/mmdetection3d/mmdet3d/models/dense heads/free anchor3d head,py", line 272,in positive bag loss
bag prob,torch.ones like(bag prob),reduction='none')File "/home nfs/xxx/anaconda3/envs/bev-py36/lib/python3.6/site-packages/torch/nn/functional.py",line 2762,in binary cross entropy
return torch.C.nn.binary cross entropy(input, target, weight, reduction enum)RuntimeError: CUDA error: device-side assert triggeredAborted (core dumped)

Rango-Zhang-Hang avatar Apr 13 '23 06:04 Rango-Zhang-Hang

I meet this error too!,Have you resolved?

Mandylove1993 avatar Apr 20 '23 10:04 Mandylove1993

I meet this error too!,Have you resolved?

Sadly no, have u?

Rango-Zhang-Hang avatar Apr 30 '23 05:04 Rango-Zhang-Hang

have you solve this problem?

silvercherry avatar May 18 '23 11:05 silvercherry

我也一样报错

huichen98 avatar May 22 '23 02:05 huichen98

把fp16注释掉,我也遇见了一样的问题,加入fp16 = dict(loss_scale='dynamic'),虽然没有这个问题,但是训练过程中,grad_norm: nan

ycdhqzhiai avatar Mar 29 '24 08:03 ycdhqzhiai

I am having the same problem...

LaCandela avatar Jun 03 '24 08:06 LaCandela