RT-DETR icon indicating copy to clipboard operation
RT-DETR copied to clipboard

matcher 偶尔会抛出一个assert错误assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Open jeycechen opened this issue 10 months ago • 6 comments
trafficstars

Star RTDETR 请先在RTDETR主页点击star以支持本项目 Star RTDETR to help more people discover this project.


Describe the bug Epoch: [24] [3700/4421] eta: 0:04:24 lr: 0.000010 loss: 19.0368 (17.4353) loss_vfl: 0.6064 (0.7137) loss_bbox: 0.0663 (0.1036) loss_giou: 0.5338 (0.5874) loss_vfl_aux_0: 0.7402 (0.8096) loss_bbox_aux_0: 0.0729 (0.1187) loss_giou_aux_0: 0.6301 (0.6288) loss_vfl_aux_1: 0.6860 (0.7895) loss_bbox_aux_1: 0.0646 (0.1089) loss_giou_aux_1: 0.5190 (0.6023) loss_vfl_aux_2: 0.6597 (0.7459) loss_bbox_aux_2: 0.0668 (0.1054) loss_giou_aux_2: 0.5340 (0.5916) loss_vfl_aux_3: 0.6401 (0.7211) loss_bbox_aux_3: 0.0681 (0.1042) loss_giou_aux_3: 0.5312 (0.5888) loss_vfl_aux_4: 0.6299 (0.7149) loss_bbox_aux_4: 0.0665 (0.1037) loss_giou_aux_4: 0.5331 (0.5876) loss_vfl_aux_5: 0.7495 (0.8039) loss_bbox_aux_5: 0.0975 (0.1549) loss_giou_aux_5: 0.7090 (0.7207) loss_vfl_dn_0: 0.5093 (0.5377) loss_bbox_dn_0: 0.0742 (0.1434) loss_giou_dn_0: 0.6288 (0.6582) loss_vfl_dn_1: 0.4673 (0.4842) loss_bbox_dn_1: 0.0617 (0.1143) loss_giou_dn_1: 0.5375 (0.5704) loss_vfl_dn_2: 0.4539 (0.4721) loss_bbox_dn_2: 0.0590 (0.1089) loss_giou_dn_2: 0.5288 (0.5569) loss_vfl_dn_3: 0.4536 (0.4660) loss_bbox_dn_3: 0.0586 (0.1078) loss_giou_dn_3: 0.5254 (0.5546) loss_vfl_dn_4: 0.4468 (0.4654) loss_bbox_dn_4: 0.0587 (0.1077) loss_giou_dn_4: 0.5274 (0.5548) loss_vfl_dn_5: 0.4519 (0.4665) loss_bbox_dn_5: 0.0587 (0.1078) loss_giou_dn_5: 0.5290 (0.5554) time: 0.3496 data: 0.0037 max mem: 17856 Epoch: [24] [3800/4421] eta: 0:03:48 lr: 0.000010 loss: 16.5786 (17.4315) loss_vfl: 0.7236 (0.7142) loss_bbox: 0.0596 (0.1032) loss_giou: 0.3736 (0.5870) loss_vfl_aux_0: 0.7646 (0.8101) loss_bbox_aux_0: 0.0631 (0.1182) loss_giou_aux_0: 0.3962 (0.6285) loss_vfl_aux_1: 0.7769 (0.7899) loss_bbox_aux_1: 0.0598 (0.1085) loss_giou_aux_1: 0.3695 (0.6020) loss_vfl_aux_2: 0.7441 (0.7465) loss_bbox_aux_2: 0.0590 (0.1050) loss_giou_aux_2: 0.3621 (0.5912) loss_vfl_aux_3: 0.7363 (0.7217) loss_bbox_aux_3: 0.0585 (0.1037) loss_giou_aux_3: 0.3685 (0.5884) loss_vfl_aux_4: 0.7446 (0.7157) loss_bbox_aux_4: 0.0583 (0.1033) loss_giou_aux_4: 0.3765 (0.5872) loss_vfl_aux_5: 0.7554 (0.8046) loss_bbox_aux_5: 0.0830 (0.1542) loss_giou_aux_5: 0.4614 (0.7201) loss_vfl_dn_0: 0.5054 (0.5377) loss_bbox_dn_0: 0.0667 (0.1430) loss_giou_dn_0: 0.5499 (0.6583) loss_vfl_dn_1: 0.4458 (0.4842) loss_bbox_dn_1: 0.0603 (0.1139) loss_giou_dn_1: 0.4702 (0.5704) loss_vfl_dn_2: 0.4360 (0.4722) loss_bbox_dn_2: 0.0594 (0.1086) loss_giou_dn_2: 0.4544 (0.5569) loss_vfl_dn_3: 0.4346 (0.4661) loss_bbox_dn_3: 0.0594 (0.1074) loss_giou_dn_3: 0.4574 (0.5545) loss_vfl_dn_4: 0.4363 (0.4655) loss_bbox_dn_4: 0.0593 (0.1074) loss_giou_dn_4: 0.4564 (0.5548) loss_vfl_dn_5: 0.4355 (0.4665) loss_bbox_dn_5: 0.0592 (0.1074) loss_giou_dn_5: 0.4564 (0.5554) time: 0.3734 data: 0.0040 max mem: 17856 Epoch: [24] [3900/4421] eta: 0:03:11 lr: 0.000010 loss: 18.1389 (17.4234) loss_vfl: 0.6440 (0.7142) loss_bbox: 0.0591 (0.1030) loss_giou: 0.5909 (0.5868) loss_vfl_aux_0: 0.6987 (0.8103) loss_bbox_aux_0: 0.0666 (0.1180) loss_giou_aux_0: 0.6362 (0.6282) loss_vfl_aux_1: 0.7231 (0.7900) loss_bbox_aux_1: 0.0646 (0.1082) loss_giou_aux_1: 0.5919 (0.6018) loss_vfl_aux_2: 0.6631 (0.7463) loss_bbox_aux_2: 0.0645 (0.1048) loss_giou_aux_2: 0.5722 (0.5911) loss_vfl_aux_3: 0.6465 (0.7217) loss_bbox_aux_3: 0.0597 (0.1036) loss_giou_aux_3: 0.5741 (0.5882) loss_vfl_aux_4: 0.6460 (0.7159) loss_bbox_aux_4: 0.0596 (0.1031) loss_giou_aux_4: 0.5903 (0.5870) loss_vfl_aux_5: 0.7163 (0.8045) loss_bbox_aux_5: 0.0816 (0.1540) loss_giou_aux_5: 0.6991 (0.7196) loss_vfl_dn_0: 0.5063 (0.5375) loss_bbox_dn_0: 0.0745 (0.1428) loss_giou_dn_0: 0.6293 (0.6578) loss_vfl_dn_1: 0.4663 (0.4840) loss_bbox_dn_1: 0.0632 (0.1137) loss_giou_dn_1: 0.5524 (0.5699) loss_vfl_dn_2: 0.4607 (0.4720) loss_bbox_dn_2: 0.0550 (0.1083) loss_giou_dn_2: 0.5654 (0.5564) loss_vfl_dn_3: 0.4641 (0.4659) loss_bbox_dn_3: 0.0548 (0.1072) loss_giou_dn_3: 0.5649 (0.5541) loss_vfl_dn_4: 0.4619 (0.4652) loss_bbox_dn_4: 0.0547 (0.1072) loss_giou_dn_4: 0.5631 (0.5543) loss_vfl_dn_5: 0.4612 (0.4663) loss_bbox_dn_5: 0.0547 (0.1072) loss_giou_dn_5: 0.5624 (0.5549) time: 0.3582 data: 0.0037 max mem: 17856 Traceback (most recent call last): File "tools/train.py", line 51, in main(args) File "tools/train.py", line 37, in main solver.fit() File "/home/amax/ckl/OI-RT-DETR/oi-rtdetr-pytorch/tools/../src/solver/det_solver.py", line 37, in fit train_stats = train_one_epoch( File "/home/amax/ckl/OI-RT-DETR/oi-rtdetr-pytorch/tools/../src/solver/det_engine.py", line 46, in train_one_epoch loss_dict = criterion(outputs, targets) File "/home/amax/miniconda3/envs/rtdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/amax/ckl/OI-RT-DETR/oi-rtdetr-pytorch/tools/../src/zoo/rtdetr/rtdetr_criterion.py", line 238, in forward indices = self.matcher(outputs_without_aux, targets) File "/home/amax/miniconda3/envs/rtdetr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/amax/miniconda3/envs/rtdetr/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/amax/ckl/OI-RT-DETR/oi-rtdetr-pytorch/tools/../src/zoo/rtdetr/matcher.py", line 99, in forward cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox)) File "/home/amax/ckl/OI-RT-DETR/oi-rtdetr-pytorch/tools/../src/zoo/rtdetr/box_ops.py", line 52, in generalized_box_iou assert (boxes1[:, 2:] >= boxes1[:, :2]).all() AssertionError

作者您好,冒昧打扰了,如报错信息所示,在box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox) 这这一行cxcywh转换为xyxy的时候似乎会发生错误,但是这个时候已经训练了24个epoch,也就是tgt_bbox 这里应该不会在转换的时候发生错误,也就是out_bbox有一个box不满足xyxy的后者点坐标大于前者的格式,也就是模型预测输出的cxcywh有负数值???

To Reproduce 比较难以复现,偶发性错误,有的时候从这个epoch训练报assert错误之后,接着--resume选项继续训练,这个epoch又能训练过去了,比如上文中的epoch 24报错,我接着epoch23的pth继续训练,结果又能通过epoch24的训练了,本人愚钝,不知道如何缓解或者解决这个问题,希望作者看到了如果有时间的话能提供一些建议,谢谢~

jeycechen avatar Jan 17 '25 03:01 jeycechen

那有点奇怪;可以加个try打印看下具体的值; 然后可以对 out_bbox做一些clip的操作限制下范围

lyuwenyu avatar Jan 17 '25 04:01 lyuwenyu

好的, 感谢作者大大~

jeycechen avatar Jan 17 '25 05:01 jeycechen

那有点奇怪;可以加个try打印看下具体的值; 然后可以对 out_bbox做一些clip的操作限制下范围

不好意思,冒昧再次打扰,试了一下,发现是有一个batch的输出是 [300,4] 的Nan,请问您对此有什么建议吗?不胜感激~

jeycechen avatar Jan 17 '25 06:01 jeycechen

There is discussion on it here and many detr family models link back to it: https://github.com/facebookresearch/detr/issues/101

There can be many causes for the error.

half-truism avatar Feb 07 '25 18:02 half-truism

好兄弟请问你解决了吗?

那有点奇怪;可以加个try打印看下具体的值; 然后可以对 out_bbox做一些clip的操作限制下范围

不好意思,冒昧再次打扰,试了一下,发现是有一个batch的输出是 [300,4] 的Nan,请问您对此有什么建议吗?不胜感激~

oooyc avatar May 11 '25 14:05 oooyc

For anyone else facing a similar error: I ran into the same issue, and in my case, it turned out that my class labels started from 1 instead of 0. Please make sure your label indices start from 0, and also verify that the number of classes in your labels matches the num_classes and associated parameter in your configuration. Mismatched class indices or counts can lead to unexpected errors during training.

gebawe avatar May 26 '25 23:05 gebawe