YOLOX
YOLOX copied to clipboard
训练yolox-s出现RuntimeError: CUDA error: device-side assert triggered问题
尝试减少batch和input_size,没有效果,之前可以正常训练,现在报错:
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [26,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [27,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [28,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [29,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [30,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [10,0,0], thread: [31,0,0] Assertion input_val >= zero && input_val <= one
failed.
2022-03-03 14:35:41 | ERROR | yolox.models.yolo_head:328 - OOM RuntimeError is raised due to the huge memory cost during label assignment. CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size.
2022-03-03 14:35:41 | INFO | yolox.core.trainer:196 - Training of experiment is done and the best AP is 0.00
2022-03-03 14:35:41 | ERROR | yolox.core.launch:98 - An error has been caught in function 'launch', process 'MainProcess' (32), thread 'MainThread' (140133772625728):
Traceback (most recent call last):
File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 322, in get_losses imgs, └ <unprintable Tensor object>
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
│ │ └ {}
│ └
File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 505, in get_assignments ) = self.dynamic_k_matching(cost, pair_wise_ious, gt_classes, num_gt, fg_mask) │ │ │ │ │ │ └ <unprintable Tensor object> │ │ │ │ │ └ 2 │ │ │ │ └ <unprintable Tensor object> │ │ │ └ <unprintable Tensor object> │ │ └ <unprintable Tensor object> │ └ <function YOLOXHead.dynamic_k_matching at 0x7f731c715950> └ YOLOXHead( (cls_convs): ModuleList( (0): Sequential( (0): BaseConv( (conv): Conv2d(128, 128, kernel_size=...
File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 616, in dynamic_k_matching dynamic_ks = dynamic_ks.tolist() │ └ <method 'tolist' of 'torch._C._TensorBase' objects> └ <unprintable Tensor object>
RuntimeError: CUDA error: device-side assert triggered
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 146, in
File "/project/train/src_repo/YOLOX/tools/../yolox/core/launch.py", line 98, in launch main_func(*args) │ └ (╒═══════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════... └ <function main at 0x7f731c747ae8>
File "train.py", line 124, in main trainer.train() │ └ <function Trainer.train at 0x7f72d48b2bf8> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>
File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 74, in train self.train_in_epoch() │ └ <function Trainer.train_in_epoch at 0x7f72d48d3f28> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>
File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 83, in train_in_epoch self.train_in_iter() │ └ <function Trainer.train_in_iter at 0x7f731c745950> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>
File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 89, in train_in_iter self.train_one_iter() │ └ <function Trainer.train_one_iter at 0x7f731c7459d8> └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>
File "/project/train/src_repo/YOLOX/tools/../yolox/core/trainer.py", line 103, in train_one_iter outputs = self.model(inps, targets) │ │ │ └ <unprintable Tensor object> │ │ └ <unprintable Tensor object> │ └ YOLOX( │ (backbone): YOLOPAFPN( │ (backbone): CSPDarknet( │ (stem): Focus( │ (conv): BaseConv( │ (conv): ... └ <yolox.core.trainer.Trainer object at 0x7f731c7589e8>
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
│ │ │ └ {}
│ │ └
File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolox.py", line 35, in forward
fpn_outs, targets, x
│ │ └ <unprintable Tensor object>
│ └ <unprintable Tensor object>
└
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
│ │ │ └ {}
│ │ └
File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 203, in forward
dtype=xin[0].dtype,
└
File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 352, in get_losses "cpu",
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
│ │ └ {}
│ └
File "/project/train/src_repo/YOLOX/tools/../yolox/models/yolo_head.py", line 446, in get_assignments gt_bboxes_per_image = gt_bboxes_per_image.cpu().float() │ └ <method 'cpu' of 'torch._C._TensorBase' objects> └ <unprintable Tensor object>
RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered
Why did this error occur55555
I have the same issue when training ByteTrack with YOLOX-X
Did you modify any code?
Did you modify any code?
The code was not modified,I found the problem in reg_conv module through debug
Did you modify any code?
Did you modify any code?
Is there something wrong with my graphics card?Or float16?It can still run on the server before。
tesla v100, face the same problems even after trying to change batchsize, suddenly encountered it and there was no such error reported before, besides find the reg_feat as NaN tensor like the former one
have anyone solve this problem ,
Did you modify any code?
Is there something wrong with my graphics card?Or float16?It can still run on the server before。
have you solve this problem, i met the same problem on v100,but i don't know the reason
tesla v100, face the same problems even after trying to change batchsize, suddenly encountered it and there was no such error reported before, besides find the reg_feat as NaN tensor like the former one
i met the same problem on v100,but i don't know the reason,can you provide some solutions,i am at a loss
Did you modify any code?
i modify my code and met this problem,can you provide some solutions,i really don't know the reason
I face the same question , anyone can help me ?
log:
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [50,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [51,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [52,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [53,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [54,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [55,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [56,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [57,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [58,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [59,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [60,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [61,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [62,0,0] Assertion input_val >= zero && input_val <= one
failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [85,0,0], thread: [63,0,0] Assertion input_val >= zero && input_val <= one
failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered
2022-04-13 16:20:52 | ERROR | yolox.models.yolo_head:330 - OOM RuntimeError is raised due to the huge memory cost during label assignment. CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size.
2022-04-13 16:20:52 | INFO | yolox.core.trainer:189 - Training of experiment is done and the best AP is 0.00
2022-04-13 16:20:52 | ERROR | yolox.core.launch:98 - An error has been caught in function 'launch', process 'MainProcess' (24745), thread 'MainThread' (139972345780032):
Traceback (most recent call last):
File "/home/zyz/Documents/YOLOX/yolox/models/yolo_head.py", line 324, in get_losses imgs, └ <unprintable Tensor object>
File "/home/zyz/.conda/envs/py38/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
│ │ └ {}
│ └
File "/home/zyz/Documents/YOLOX/yolox/models/yolo_head.py", line 519, in get_assignments cls_preds_.sqrt_(), gt_cls_per_image, reduction="none" │ │ └ <unprintable Tensor object> │ └ <method 'sqrt_' of 'torch._C._TensorBase' objects> └ <unprintable Tensor object>
File "/home/zyz/.conda/envs/py38/lib/python3.7/site-packages/torch/nn/functional.py", line 2759, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
│ │ │ │ │ │ │ └ 0
│ │ │ │ │ │ └ None
│ │ │ │ │ └ <unprintable Tensor object>
│ │ │ │ └ <unprintable Tensor object>
│ │ │ └
RuntimeError: CUDA error: device-side assert triggered
@FateScript
maybe you can check you environment,and is the version of python 3.8?
I'm confused about this problem too
我开始也遇到同样的问题,是Tesla V100。后来把Batch size从48一路尝试降到16后就没出现过这个问题了。GPU内存应该是足够的,不知道为什么batch size会导致这个错误。不知道会不会因为大batch在做mixed precision的时候导致gradient有问题?因为我观测到报错前有些iter会出现nan的loss
Setting the learning rate a little smaller can solve this problem. In fact, I succeeded.
Setting the learning rate a little smaller can solve this problem. In fact, I succeeded.
So why large lr may cause this problem?
同样遇到该问题,怎么解决的哇?
About this problem, I finally found that the gpu is broken. You need to change the CPU to run the yolox code to determine whether there is a problem with your code
Had there anyone fixed it? I met the same problem。。。Any I have tried to reduce the lr and batch-size,still useless......
And that's my GPU device shown below:
I run my code in docker container. And I have run others code using the GPU without any problem...
Also, I have run the same code with same setting of batch-size and lr before in similar docker container successfully.
有人修过吗?我遇到了同样的问题。。。任何我尝试过减少lr和batch-size,仍然没用...... 这是我的GPU设备,如下所示: 我在docker容器中运行我的代码。而且我已经使用 GPU 运行其他代码没有任何问题...... 此外,我之前在类似的 docker 容器中成功地运行了具有相同批处理大小和 lr 设置的相同代码。
You first switch to CPU to run YOLOX, to confirm that it is not a problem with the code
how to switch to CPU? by any simple parameter setting?
i also met this question, after i add siou loss and uesed it...
I met same problem.Then, I can train normally after turning off fp16
I met same problem.Then, I can train normally after turning off fp16
This worked for me, thank you
Anyone solved this problems ? I tried everything suggested here such as : reduce lr, turning off fp16, reduce batch size, turn off cache. I even switch yolox back to 0.2.0, switch python version. But nothing worked. ....
I met same problem.Then, I can train normally after turning off fp16
omg!!!!!thank you ! useful!
Anyone solved this problems ? I tried everything suggested here such as : reduce lr, turning off fp16, reduce batch size, turn off cache. I even switch yolox back to 0.2.0, switch python version. But nothing worked. ....
Have you ever tried to change FOCUS to a normal 3*3 convolution?
可以试一下加载预训练模型或者将bn改成gn
---原始邮件--- 发件人: "Yu @.> 发送时间: 2022年11月7日(周一) 下午4:37 收件人: @.>; 抄送: @.@.>; 主题: Re: [Megvii-BaseDetection/YOLOX] 训练yolox-s出现RuntimeError: CUDA error: device-side assert triggered问题 (Issue #1161)
Anyone solved this problems ? I tried everything suggested here such as : reduce lr, turning off fp16, reduce batch size, turn off cache. I even switch yolox back to 0.2.0, switch python version. But nothing worked. ....
Have you ever tried to change FOCUS to a normal 3*3 convolution?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
i also met this question, after i add siou loss and uesed it...
What did you do to fix it?
@gachiemchiep Have you solved this problem?