YOLOX icon indicating copy to clipboard operation
YOLOX copied to clipboard

训练yolox-s出现RuntimeError: CUDA error: device-side assert triggered问题

Open GuoXu-booo opened this issue 2 years ago • 32 comments

尝试减少batch和input_size,没有效果,之前可以正常训练,现在报错: /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [26,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [27,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [28,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [29,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [30,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [31,0,0] Assertion input_val >= zero && input_val <= one failed. 2022-03-03 14:35:41 | ERROR | yolox.models.yolo_head:328 - OOM RuntimeError is raised due to the huge memory cost during label assignment. CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size. 2022-03-03 14:35:41 | INFO | yolox.core.trainer:196 - Training of experiment is done and the best AP is 0.00 2022-03-03 14:35:41 | ERROR | yolox.core.launch:98 - An error has been caught in function 'launch', process 'MainProcess' (32), thread 'MainThread' (140133772625728): Traceback (most recent call last):

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 322, in get_losses imgs, └ <unprintable Tensor object>

File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) │ │ └ {} │ └ └ <function YOLOXHead.get_assignments at 0x7f731c7157b8>

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 505, in get_assignments ) = self.dynamic_k_matching(cost, pair_wise_ious, gt_classes, num_gt, fg_mask) │ │ │ │ │ │ └ <unprintable Tensor object> │ │ │ │ │ └ 2 │ │ │ │ └ <unprintable Tensor object> │ │ │ └ <unprintable Tensor object> │ │ └ <unprintable Tensor object> │ └ <function YOLOXHead.dynamic_k_matching at 0x7f731c715950> └ YOLOXHead( (cls_convs): ModuleList( (0): Sequential( (0): BaseConv( (conv): Conv2d(128, 128, kernel_size=...

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 616, in dynamic_k_matching dynamic_ks = dynamic_ks.tolist() │ └ <method 'tolist' of 'torch._C._TensorBase' objects> └ <unprintable Tensor object>

RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "train.py", line 146, in args=(exp, args), │ └ Namespace(batch_size=8, cache=False, ckpt='/project/train/models/weight/yolox_s.pth', devices=0, dist_backend='nccl', dist_ur... └ ╒═══════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════...

File "/project/train/src_repo/YOLOX/tools/../yolox/core/launch.py", line 98, in launch main_func(*args) │ └ (╒═══════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════... └ <function main at 0x7f731c747ae8>

File "train.py", line 124, in main trainer.train() │ └ <function Trainer.train at 0x7f72d48b2bf8> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>

File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 74, in train self.train_in_epoch() │ └ <function Trainer.train_in_epoch at 0x7f72d48d3f28> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>

File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 83, in train_in_epoch self.train_in_iter() │ └ <function Trainer.train_in_iter at 0x7f731c745950> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>

File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 89, in train_in_iter self.train_one_iter() │ └ <function Trainer.train_one_iter at 0x7f731c7459d8> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>

File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 103, in train_one_iter outputs = self.model(inps, targets) │ │ │ └ <unprintable Tensor object> │ │ └ <unprintable Tensor object> │ └ YOLOX( │ (backbone): YOLOPAFPN( │ (backbone): CSPDarknet( │ (stem): Focus( │ (conv): BaseConv( │ (conv): ... └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) │ │ │ └ {} │ │ └ │ └ <function YOLOX.forward at 0x7f731c715c80> └ YOLOX( (backbone): YOLOPAFPN( (backbone): CSPDarknet( (stem): Focus( (conv): BaseConv( (conv): ...

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolox.py", line 35, in forward fpn_outs, targets, x │ │ └ <unprintable Tensor object> │ └ <unprintable Tensor object> └

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) │ │ │ └ {} │ │ └ │ └ <function YOLOXHead.forward at 0x7f731c715510> └ YOLOXHead( (cls_convs): ModuleList( (0): Sequential( (0): BaseConv( (conv): Conv2d(128, 128, kernel_size=...

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 203, in forward dtype=xin[0].dtype, └

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 352, in get_losses "cpu",

File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) │ │ └ {} │ └ └ <function YOLOXHead.get_assignments at 0x7f731c7157b8>

File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 446, in get_assignments gt_bboxes_per_image = gt_bboxes_per_image.cpu().float() │ └ <method 'cpu' of 'torch._C._TensorBase' objects> └ <unprintable Tensor object>

RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered

GuoXu-booo avatar Mar 05 '22 03:03 GuoXu-booo

Why did this error occur55555

GuoXu-booo avatar Mar 05 '22 03:03 GuoXu-booo

I have the same issue when training ByteTrack with YOLOX-X

wangyirui avatar Mar 07 '22 15:03 wangyirui

Did you modify any code?

FateScript avatar Mar 08 '22 03:03 FateScript

Did you modify any code?

The code was not modified,I found the problem in reg_conv module through debug

GuoXu-booo avatar Mar 08 '22 03:03 GuoXu-booo

Did you modify any code?

image

GuoXu-booo avatar Mar 08 '22 03:03 GuoXu-booo

Did you modify any code?

Is there something wrong with my graphics card?Or float16?It can still run on the server before。

GuoXu-booo avatar Mar 08 '22 03:03 GuoXu-booo

tesla v100, face the same problems even after trying to change batchsize, suddenly encountered it and there was no such error reported before, besides find the reg_feat as NaN tensor like the former one

ELongking avatar Mar 17 '22 10:03 ELongking

have anyone solve this problem ,

Did you modify any code?

Is there something wrong with my graphics card?Or float16?It can still run on the server before。

have you solve this problem, i met the same problem on v100,but i don't know the reason

ilmoney avatar Apr 08 '22 13:04 ilmoney

tesla v100, face the same problems even after trying to change batchsize, suddenly encountered it and there was no such error reported before, besides find the reg_feat as NaN tensor like the former one

i met the same problem on v100,but i don't know the reason,can you provide some solutions,i am at a loss

ilmoney avatar Apr 08 '22 13:04 ilmoney

Did you modify any code?

i modify my code and met this problem,can you provide some solutions,i really don't know the reason

ilmoney avatar Apr 08 '22 13:04 ilmoney

I face the same question , anyone can help me ? log: /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [50,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [51,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [52,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [53,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [54,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [55,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [56,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [57,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [58,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [59,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [60,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [61,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [62,0,0] Assertion input_val >= zero && input_val <= one failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [63,0,0] Assertion input_val >= zero && input_val <= one failed. THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered 2022-04-13 16:20:52 | ERROR | yolox.models.yolo_head:330 - OOM RuntimeError is raised due to the huge memory cost during label assignment. CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size. 2022-04-13 16:20:52 | INFO | yolox.core.trainer:189 - Training of experiment is done and the best AP is 0.00 2022-04-13 16:20:52 | ERROR | yolox.core.launch:98 - An error has been caught in function 'launch', process 'MainProcess' (24745), thread 'MainThread' (139972345780032): Traceback (most recent call last):

File "/home/zyz/Documents/YOLOX/yolox/models/yolo_head.py", line 324, in get_losses imgs, └ <unprintable Tensor object>

File "/home/zyz/.conda/envs/py38/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) │ │ └ {} │ └ └ <function YOLOXHead.get_assignments at 0x7f4d655dc2f0>

File "/home/zyz/Documents/YOLOX/yolox/models/yolo_head.py", line 519, in get_assignments cls_preds_.sqrt_(), gt_cls_per_image, reduction="none" │ │ └ <unprintable Tensor object> │ └ <method 'sqrt_' of 'torch._C._TensorBase' objects> └ <unprintable Tensor object>

File "/home/zyz/.conda/envs/py38/lib/python3.7/site-packages/torch/nn/functional.py", line 2759, in binary_cross_entropy return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum) │ │ │ │ │ │ │ └ 0 │ │ │ │ │ │ └ None │ │ │ │ │ └ <unprintable Tensor object> │ │ │ │ └ <unprintable Tensor object> │ │ │ └ │ │ └ <module 'torch._C._nn'> │ └ <module 'torch._C' from '/home/zyz/.conda/envs/py38/lib/python3.7/site-packages/torch/_C.cpython-37m-x86_64-linux-gnu.so'> └ <module 'torch' from '/home/zyz/.conda/envs/py38/lib/python3.7/site-packages/torch/init.py'>

RuntimeError: CUDA error: device-side assert triggered

@FateScript

bitzyz avatar Apr 13 '22 08:04 bitzyz

maybe you can check you environment,and is the version of python 3.8?

ilmoney avatar Apr 15 '22 03:04 ilmoney

I'm confused about this problem too

kuazhangxiaoai avatar Apr 15 '22 10:04 kuazhangxiaoai

我开始也遇到同样的问题,是Tesla V100。后来把Batch size从48一路尝试降到16后就没出现过这个问题了。GPU内存应该是足够的,不知道为什么batch size会导致这个错误。不知道会不会因为大batch在做mixed precision的时候导致gradient有问题?因为我观测到报错前有些iter会出现nan的loss

wangyirui avatar Apr 15 '22 18:04 wangyirui

Setting the learning rate a little smaller can solve this problem. In fact, I succeeded.

Yuanyang-Zhu avatar Apr 20 '22 08:04 Yuanyang-Zhu

Setting the learning rate a little smaller can solve this problem. In fact, I succeeded.

So why large lr may cause this problem?

ChiefGodMan avatar Jun 04 '22 05:06 ChiefGodMan

同样遇到该问题,怎么解决的哇?

gjd2017 avatar Jun 08 '22 07:06 gjd2017

About this problem, I finally found that the gpu is broken. You need to change the CPU to run the yolox code to determine whether there is a problem with your code

GuoXu-booo avatar Jun 09 '22 03:06 GuoXu-booo

Had there anyone fixed it? I met the same problem。。。Any I have tried to reduce the lr and batch-size,still useless...... And that's my GPU device shown below: 1 I run my code in docker container. And I have run others code using the GPU without any problem... Also, I have run the same code with same setting of batch-size and lr before in similar docker container successfully.

lmw0320 avatar Jul 07 '22 00:07 lmw0320

有人修过吗?我遇到了同样的问题。。。任何我尝试过减少lr和batch-size,仍然没用...... 这是我的GPU设备,如下所示: 我在docker容器中运行我的代码。而且我已经使用 GPU 运行其他代码没有任何问题...... 此外,我之前在类似的 docker 容器中成功地运行了具有相同批处理大小和 lr 设置的相同代码。 1

You first switch to CPU to run YOLOX, to confirm that it is not a problem with the code

GuoXu-booo avatar Jul 07 '22 00:07 GuoXu-booo

how to switch to CPU? by any simple parameter setting?

lmw0320 avatar Jul 07 '22 03:07 lmw0320

i also met this question, after i add siou loss and uesed it...

iodncookie avatar Aug 02 '22 06:08 iodncookie

I met same problem.Then, I can train normally after turning off fp16

nanhai78 avatar Aug 14 '22 11:08 nanhai78

I met same problem.Then, I can train normally after turning off fp16

This worked for me, thank you

lawrencekiba avatar Oct 13 '22 09:10 lawrencekiba

Anyone solved this problems ? I tried everything suggested here such as : reduce lr, turning off fp16, reduce batch size, turn off cache. I even switch yolox back to 0.2.0, switch python version. But nothing worked. ....

gachiemchiep avatar Oct 14 '22 11:10 gachiemchiep

I met same problem.Then, I can train normally after turning off fp16

omg!!!!!thank you ! useful!

flyingfish7777 avatar Oct 17 '22 08:10 flyingfish7777

Anyone solved this problems ? I tried everything suggested here such as : reduce lr, turning off fp16, reduce batch size, turn off cache. I even switch yolox back to 0.2.0, switch python version. But nothing worked. ....

Have you ever tried to change FOCUS to a normal 3*3 convolution?

chairc avatar Nov 07 '22 08:11 chairc

可以试一下加载预训练模型或者将bn改成gn

---原始邮件--- 发件人: "Yu @.> 发送时间: 2022年11月7日(周一) 下午4:37 收件人: @.>; 抄送: @.@.>; 主题: Re: [Megvii-BaseDetection/YOLOX] 训练yolox-s出现RuntimeError: CUDA error: device-side assert triggered问题 (Issue #1161)

Anyone solved this problems ? I tried everything suggested here such as : reduce lr, turning off fp16, reduce batch size, turn off cache. I even switch yolox back to 0.2.0, switch python version. But nothing worked. ....

Have you ever tried to change FOCUS to a normal 3*3 convolution?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

nuaaaaa avatar Nov 08 '22 05:11 nuaaaaa

i also met this question, after i add siou loss and uesed it...

What did you do to fix it?

QAQEthan avatar Dec 14 '22 07:12 QAQEthan

@gachiemchiep Have you solved this problem?

QAQEthan avatar Dec 16 '22 03:12 QAQEthan