mmdetection
mmdetection copied to clipboard
[Training is in progress] [Feature] Support RT-DETR
Checklist
- [x] Reproduce with pre-trained weight
- [ ] Reproduce training
- [x] Unit Test
- [ ] Complement Docstring (typehint)
Motivation
Support RT-DETR https://arxiv.org/abs/2304.08069 resolves https://github.com/open-mmlab/mmdetection/issues/10186
Consideration
- In RT-DETR, the transformer encoder is applied as a neck.
- In RT-DETR, bbox_heads are only used in the transformer decoder.
Modification
COCO_val evaluation.
06/13 11:28:51 - mmengine - INFO - Evaluating bbox...
Loading and preparing results...
DONE (t=7.24s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=140.32s).
Accumulating evaluation results...
DONE (t=37.14s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.531
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.713
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.577
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.348
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.580
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.700
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.723
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.725
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.725
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.550
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.767
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.882
06/13 11:32:08 - mmengine - INFO - bbox_mAP_copypaste: 0.531 0.713 0.577 0.348 0.580 0.700
06/13 11:32:10 - mmengine - INFO - Epoch(test) [5000/5000] coco/bbox_mAP: 0.5310 coco/bbox_mAP_50: 0.7130 coco/bbox_mAP_75: 0.5770 coco/bbox_mAP_s: 0.3480 coco/bbox_mAP_m: 0.5800 coco/bbox_mAP_l: 0.7000 data_time: 0.0048 time: 0.2935
Current Status
Training Performance is not reproduced yet
With the current config, I got below result. The performance fluctuates after 18 epochs.
...
2023/06/14 18:24:42 - mmengine - INFO - Epoch(val) [6][2500/2500] coco/bbox_mAP: 0.4540 coco/bbox_mAP_50: 0.6260 coco/bbox_mAP_75: 0.4910 coco/bbox_mAP_s: 0.2580 coco/bbox_mAP_m: 0.5020 coco/bbox_mAP_l: 0.6400 data_time: 0.0018 time: 0.0309
2023/06/14 23:26:23 - mmengine - INFO - Epoch(val) [12][2500/2500] coco/bbox_mAP: 0.4850 coco/bbox_mAP_50: 0.6610 coco/bbox_mAP_75: 0.5220 coco/bbox_mAP_s: 0.2860 coco/bbox_mAP_m: 0.5310 coco/bbox_mAP_l: 0.6720 data_time: 0.0016 time: 0.0312
2023/06/15 04:30:59 - mmengine - INFO - Epoch(val) [18][2500/2500] coco/bbox_mAP: 0.4960 coco/bbox_mAP_50: 0.6750 coco/bbox_mAP_75: 0.5370 coco/bbox_mAP_s: 0.2950 coco/bbox_mAP_m: 0.5430 coco/bbox_mAP_l: 0.6850 data_time: 0.0018 time: 0.0316
...
2023/06/17 05:27:59 - mmengine - INFO - Epoch(val) [72][2500/2500] coco/bbox_mAP: 0.4960 coco/bbox_mAP_50: 0.6780 coco/bbox_mAP_75: 0.5320 coco/bbox_mAP_s: 0.3070 coco/bbox_mAP_m: 0.5470 coco/bbox_mAP_l: 0.6680 data_time: 0.0015 time: 0.0305
I thought the reason is due to the difference in transforms in the training dataset, so edited as below;
Main difference lies on some hyperparameters and the order of RandomCrop.
train_pipeline = [
dict(type='LoadImageFromFile', backend_args={{_base_.backend_args}}),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='PhotoMetricDistortion'),
dict(
type='Expand',
mean=[123.675, 116.28, 103.53],
to_rgb=True,
ratio_range=(1, 4)),
dict(type='RandomCrop', crop_size=(0.3, 1.0), crop_type='relative_range'),
dict(type='RandomFlip', prob=0.5),
dict(type='RandomChoiceResize',
scales=[(480, 480), (512, 512), (544, 544), (576, 576),
(608, 608), (640, 640), (640, 640), (640, 640),
(672, 672), (704, 704), (736, 736), (768, 768),
(800, 800)],
keep_ratio=False),
dict(type='PackDetInputs')
]
But, it shows slower convergence and poor result.
2023/06/16 13:03:15 - mmengine - INFO - Epoch(val) [6][2500/2500] coco/bbox_mAP: 0.3670 coco/bbox_mAP_50: 0.5340 coco/bbox_mAP_75: 0.3950 coco/bbox_mAP_s: 0.1730 coco/bbox_mAP_m: 0.4070 coco/bbox_mAP_l: 0.5530 data_time: 0.0019 time: 0.0317
2023/06/16 19:28:04 - mmengine - INFO - Epoch(val) [12][2500/2500] coco/bbox_mAP: 0.4230 coco/bbox_mAP_50: 0.5980 coco/bbox_mAP_75: 0.4560 coco/bbox_mAP_s: 0.2140 coco/bbox_mAP_m: 0.4680 coco/bbox_mAP_l: 0.6230 data_time: 0.0017 time: 0.0315
2023/06/17 01:52:12 - mmengine - INFO - Epoch(val) [18][2500/2500] coco/bbox_mAP: 0.4440 coco/bbox_mAP_50: 0.6220 coco/bbox_mAP_75: 0.4780 coco/bbox_mAP_s: 0.2440 coco/bbox_mAP_m: 0.4890 coco/bbox_mAP_l: 0.6390 data_time: 0.0016 time: 0.0312
2023/06/17 08:18:02 - mmengine - INFO - Epoch(val) [24][2500/2500] coco/bbox_mAP: 0.4580 coco/bbox_mAP_50: 0.6380 coco/bbox_mAP_75: 0.4950 coco/bbox_mAP_s: 0.2500 coco/bbox_mAP_m: 0.5050 coco/bbox_mAP_l: 0.6520 data_time: 0.0021 time: 0.0323
...
06/19 12:47:44 - mmengine - INFO - Epoch(val) [72][2500/2500] coco/bbox_mAP: 0.4900 coco/bbox_mAP_50: 0.6750 coco/bbox_mAP_75: 0.5270 coco/bbox_mAP_s: 0.3030 coco/bbox_mAP_m: 0.5380 coco/bbox_mAP_l: 0.6780 data_time: 0.0017 time: 0.0317
I'm still trying to figure out this.
Current Status
Training Performance is not reproduced yet
With the current config, I got below result. The performance fluctuates after 18 epochs.
... 2023/06/14 18:24:42 - mmengine - INFO - Epoch(val) [6][2500/2500] coco/bbox_mAP: 0.4540 coco/bbox_mAP_50: 0.6260 coco/bbox_mAP_75: 0.4910 coco/bbox_mAP_s: 0.2580 coco/bbox_mAP_m: 0.5020 coco/bbox_mAP_l: 0.6400 data_time: 0.0018 time: 0.0309 2023/06/14 23:26:23 - mmengine - INFO - Epoch(val) [12][2500/2500] coco/bbox_mAP: 0.4850 coco/bbox_mAP_50: 0.6610 coco/bbox_mAP_75: 0.5220 coco/bbox_mAP_s: 0.2860 coco/bbox_mAP_m: 0.5310 coco/bbox_mAP_l: 0.6720 data_time: 0.0016 time: 0.0312 2023/06/15 04:30:59 - mmengine - INFO - Epoch(val) [18][2500/2500] coco/bbox_mAP: 0.4960 coco/bbox_mAP_50: 0.6750 coco/bbox_mAP_75: 0.5370 coco/bbox_mAP_s: 0.2950 coco/bbox_mAP_m: 0.5430 coco/bbox_mAP_l: 0.6850 data_time: 0.0018 time: 0.0316 ... 2023/06/17 05:27:59 - mmengine - INFO - Epoch(val) [72][2500/2500] coco/bbox_mAP: 0.4960 coco/bbox_mAP_50: 0.6780 coco/bbox_mAP_75: 0.5320 coco/bbox_mAP_s: 0.3070 coco/bbox_mAP_m: 0.5470 coco/bbox_mAP_l: 0.6680 data_time: 0.0015 time: 0.0305I thought the reason is due to the difference in
transformsin the training dataset, so edited as below; Main difference lies on some hyperparameters and the order ofRandomCrop.train_pipeline = [ dict(type='LoadImageFromFile', backend_args={{_base_.backend_args}}), dict(type='LoadAnnotations', with_bbox=True), dict(type='PhotoMetricDistortion'), dict( type='Expand', mean=[123.675, 116.28, 103.53], to_rgb=True, ratio_range=(1, 4)), dict(type='RandomCrop', crop_size=(0.3, 1.0), crop_type='relative_range'), dict(type='RandomFlip', prob=0.5), dict(type='RandomChoiceResize', scales=[(480, 480), (512, 512), (544, 544), (576, 576), (608, 608), (640, 640), (640, 640), (640, 640), (672, 672), (704, 704), (736, 736), (768, 768), (800, 800)], keep_ratio=False), dict(type='PackDetInputs') ]But, it shows slower convergence and poor result.
2023/06/16 13:03:15 - mmengine - INFO - Epoch(val) [6][2500/2500] coco/bbox_mAP: 0.3670 coco/bbox_mAP_50: 0.5340 coco/bbox_mAP_75: 0.3950 coco/bbox_mAP_s: 0.1730 coco/bbox_mAP_m: 0.4070 coco/bbox_mAP_l: 0.5530 data_time: 0.0019 time: 0.0317 2023/06/16 19:28:04 - mmengine - INFO - Epoch(val) [12][2500/2500] coco/bbox_mAP: 0.4230 coco/bbox_mAP_50: 0.5980 coco/bbox_mAP_75: 0.4560 coco/bbox_mAP_s: 0.2140 coco/bbox_mAP_m: 0.4680 coco/bbox_mAP_l: 0.6230 data_time: 0.0017 time: 0.0315 2023/06/17 01:52:12 - mmengine - INFO - Epoch(val) [18][2500/2500] coco/bbox_mAP: 0.4440 coco/bbox_mAP_50: 0.6220 coco/bbox_mAP_75: 0.4780 coco/bbox_mAP_s: 0.2440 coco/bbox_mAP_m: 0.4890 coco/bbox_mAP_l: 0.6390 data_time: 0.0016 time: 0.0312 2023/06/17 08:18:02 - mmengine - INFO - Epoch(val) [24][2500/2500] coco/bbox_mAP: 0.4580 coco/bbox_mAP_50: 0.6380 coco/bbox_mAP_75: 0.4950 coco/bbox_mAP_s: 0.2500 coco/bbox_mAP_m: 0.5050 coco/bbox_mAP_l: 0.6520 data_time: 0.0021 time: 0.0323 ... 06/19 12:47:44 - mmengine - INFO - Epoch(val) [72][2500/2500] coco/bbox_mAP: 0.4900 coco/bbox_mAP_50: 0.6750 coco/bbox_mAP_75: 0.5270 coco/bbox_mAP_s: 0.3030 coco/bbox_mAP_m: 0.5380 coco/bbox_mAP_l: 0.6780 data_time: 0.0017 time: 0.0317I'm still trying to figure out this.
You can refer to the rt-detr reproduced in yolov8, maybe you will find something new.
Unfortunately, I think it seems that the maintainers of yolov8 did not reproduce the performance in the paper. Default hyperparameters are way different from the paper.
from ultralytics import RTDETR
model = RTDETR()
model.info() # display model information
model.train(data="coco.yaml") # train
model.predict("path/to/image.jpg") # predict
log
rt-detr-l summary: 673 layers, 32970476 parameters, 32970476 gradients
Ultralytics YOLOv8.0.120 🚀 Python-3.8.5 torch-1.9.1+cu111 CUDA:0
yolo/engine/trainer: task=detect, mode=train, model=None, data=coco.yaml, epochs=100, patience=50,
batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=None, workers=8, project=None,
name=None, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=False,
single_cls=False, rect=False, cos_lr=False, close_mosaic=0, resume=False, amp=True, fraction=1.0, profile=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False,
conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, vid_stride=1, line_width=None,
visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True,
format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0,
warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0,
nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, v5loader=False, tracker=botsort.yaml,
Now, I'm going to compare it with one in ppdet module by module.
https://github.com/lyuwenyu/RT-DETR @nijkah
https://github.com/lyuwenyu/RT-DETR @nijkah
@hhaAndroid I came back from holidays. Sorry for the late progress.
It seems that the repository only provides the inference code. I still think the performance difference is based on the training part. The prediction output is almost same between the migrated model and the original one as below;
[Left: Demo from MMDet, Right: Demo from PaddleDet]
The training loss seems quite different between the ported one and the original one, especially in loss_class.
# mmdet 8bs x 2GPU
07/04 08:06:25 - mmengine - INFO - Epoch(train) [1][ 50/7393] base_lr: 2.5488e-06 lr: 2.5488e-06 eta: 3 days, 13:53:13 time: 0.5809 data_time: 0.0409 memory: 14765 grad_norm: 25.0602 loss: 44.3442 loss_cls: 0.2494 loss_bbox: 1.5884 loss_iou: 1.8367 d0.loss_cls: 0.2357 d0.loss_bbox: 1.6084 d0.loss_iou: 1.8593 d1.loss_cls: 0.2374 d1.loss_bbox: 1.6023 d1.loss_iou: 1.8487 d2.loss_cls: 0.2411 d2.loss_bbox: 1.5994 d2.loss_iou: 1.8468 d3.loss_cls: 0.2436 d3.loss_bbox: 1.5953 d3.loss_iou: 1.8439 d4.loss_cls: 0.2500 d4.loss_bbox: 1.5910 d4.loss_iou: 1.8398 enc_loss_cls: 0.2332 enc_loss_bbox: 1.6214 enc_loss_iou: 1.8650 dn_loss_cls: 0.8697 dn_loss_bbox: 0.9152 dn_loss_iou: 1.2991 d0.dn_loss_cls: 0.8708 d0.dn_loss_bbox: 0.9153 d0.dn_loss_iou: 1.2993 d1.dn_loss_cls: 0.8600 d1.dn_loss_bbox: 0.9153 d1.dn_loss_iou: 1.2992 d2.dn_loss_cls: 0.8680 d2.dn_loss_bbox: 0.9153 d2.dn_loss_iou: 1.2992 d3.dn_loss_cls: 0.8730 d3.dn_loss_bbox: 0.9152 d3.dn_loss_iou: 1.2992 d4.dn_loss_cls: 0.8789 d4.dn_loss_bbox: 0.9152 d4.dn_loss_iou: 1.2992
07/04 08:06:52 - mmengine - INFO - Epoch(train) [1][ 100/7393] base_lr: 5.0475e-06 lr: 5.0475e-06 eta: 3 days, 12:12:11 time: 0.5582 data_time: 0.0150 memory: 14765 grad_norm: 26.7655 loss: 43.8603 loss_cls: 0.2484 loss_bbox: 1.5579 loss_iou: 1.8946 d0.loss_cls: 0.2508 d0.loss_bbox: 1.5876 d0.loss_iou: 1.9174 d1.loss_cls: 0.2440 d1.loss_bbox: 1.5792 d1.loss_iou: 1.9120 d2.loss_cls: 0.2436 d2.loss_bbox: 1.5732 d2.loss_iou: 1.9078 d3.loss_cls: 0.2440 d3.loss_bbox: 1.5698 d3.loss_iou: 1.9029 d4.loss_cls: 0.2459 d4.loss_bbox: 1.5625 d4.loss_iou: 1.9000 enc_loss_cls: 0.2519 enc_loss_bbox: 1.5984 enc_loss_iou: 1.9295 dn_loss_cls: 0.7495 dn_loss_bbox: 0.8637 dn_loss_iou: 1.3038 d0.dn_loss_cls: 0.8425 d0.dn_loss_bbox: 0.8633 d0.dn_loss_iou: 1.3045 d1.dn_loss_cls: 0.8067 d1.dn_loss_bbox: 0.8633 d1.dn_loss_iou: 1.3044 d2.dn_loss_cls: 0.7926 d2.dn_loss_bbox: 0.8634 d2.dn_loss_iou: 1.3042 d3.dn_loss_cls: 0.7785 d3.dn_loss_bbox: 0.8635 d3.dn_loss_iou: 1.3041 d4.dn_loss_cls: 0.7633 d4.dn_loss_bbox: 0.8636 d4.dn_loss_iou: 1.3039
07/04 08:07:20 - mmengine - INFO - Epoch(train) [1][ 150/7393] base_lr: 7.5463e-06 lr: 7.5463e-06 eta: 3 days, 11:36:14 time: 0.5576 data_time: 0.0148 memory: 14765 grad_norm: 24.9964 loss: 44.6943 loss_cls: 0.2648 loss_bbox: 1.6522 loss_iou: 1.9208 d0.loss_cls: 0.2474 d0.loss_bbox: 1.6940 d0.loss_iou: 1.9675 d1.loss_cls: 0.2439 d1.loss_bbox: 1.6920 d1.loss_iou: 1.9536 d2.loss_cls: 0.2485 d2.loss_bbox: 1.6766 d2.loss_iou: 1.9466 d3.loss_cls: 0.2559 d3.loss_bbox: 1.6665 d3.loss_iou: 1.9365 d4.loss_cls: 0.2575 d4.loss_bbox: 1.6597 d4.loss_iou: 1.9299 enc_loss_cls: 0.2494 enc_loss_bbox: 1.7102 enc_loss_iou: 1.9786 dn_loss_cls: 0.6944 dn_loss_bbox: 0.8987 dn_loss_iou: 1.2998 d0.dn_loss_cls: 0.8105 d0.dn_loss_bbox: 0.8977 d0.dn_loss_iou: 1.3003 d1.dn_loss_cls: 0.7438 d1.dn_loss_bbox: 0.8976 d1.dn_loss_iou: 1.3000 d2.dn_loss_cls: 0.7134 d2.dn_loss_bbox: 0.8977 d2.dn_loss_iou: 1.2998 d3.dn_loss_cls: 0.7034 d3.dn_loss_bbox: 0.8979 d3.dn_loss_iou: 1.2997 d4.dn_loss_cls: 0.6896 d4.dn_loss_bbox: 0.8982 d4.dn_loss_iou: 1.2996
07/04 08:07:49 - mmengine - INFO - Epoch(train) [1][ 200/7393] base_lr: 1.0045e-05 lr: 1.0045e-05 eta: 3 days, 11:41:22 time: 0.5681 data_time: 0.0172 memory: 14765 grad_norm: 31.5467 loss: 43.6378 loss_cls: 0.2767 loss_bbox: 1.5410 loss_iou: 1.8848 d0.loss_cls: 0.2576 d0.loss_bbox: 1.6095 d0.loss_iou: 1.9528 d1.loss_cls: 0.2533 d1.loss_bbox: 1.5953 d1.loss_iou: 1.9377 d2.loss_cls: 0.2638 d2.loss_bbox: 1.5764 d2.loss_iou: 1.9223 d3.loss_cls: 0.2662 d3.loss_bbox: 1.5630 d3.loss_iou: 1.9058 d4.loss_cls: 0.2720 d4.loss_bbox: 1.5519 d4.loss_iou: 1.8941 enc_loss_cls: 0.2644 enc_loss_bbox: 1.6327 enc_loss_iou: 1.9707 dn_loss_cls: 0.6153 dn_loss_bbox: 0.8752 dn_loss_iou: 1.3542 d0.dn_loss_cls: 0.7631 d0.dn_loss_bbox: 0.8617 d0.dn_loss_iou: 1.3491 d1.dn_loss_cls: 0.6712 d1.dn_loss_bbox: 0.8626 d1.dn_loss_iou: 1.3489 d2.dn_loss_cls: 0.6450 d2.dn_loss_bbox: 0.8645 d2.dn_loss_iou: 1.3492 d3.dn_loss_cls: 0.6290 d3.dn_loss_bbox: 0.8674 d3.dn_loss_iou: 1.3502 d4.dn_loss_cls: 0.6166 d4.dn_loss_bbox: 0.8711 d4.dn_loss_iou: 1.3518
07/04 08:08:17 - mmengine - INFO - Epoch(train) [1][ 250/7393] base_lr: 1.2544e-05 lr: 1.2544e-05 eta: 3 days, 11:21:57 time: 0.5555 data_time: 0.0152 memory: 14765 grad_norm: 44.9920 loss: 42.1280 loss_cls: 0.3194 loss_bbox: 1.4485 loss_iou: 1.7340 d0.loss_cls: 0.2736 d0.loss_bbox: 1.5596 d0.loss_iou: 1.8383 d1.loss_cls: 0.2846 d1.loss_bbox: 1.5293 d1.loss_iou: 1.8090 d2.loss_cls: 0.2936 d2.loss_bbox: 1.5019 d2.loss_iou: 1.7792 d3.loss_cls: 0.3022 d3.loss_bbox: 1.4815 d3.loss_iou: 1.7609 d4.loss_cls: 0.3108 d4.loss_bbox: 1.4662 d4.loss_iou: 1.7443 enc_loss_cls: 0.2809 enc_loss_bbox: 1.5873 enc_loss_iou: 1.8681 dn_loss_cls: 0.5758 dn_loss_bbox: 0.9277 dn_loss_iou: 1.3218 d0.dn_loss_cls: 0.7110 d0.dn_loss_bbox: 0.8735 d0.dn_loss_iou: 1.3049 d1.dn_loss_cls: 0.6347 d1.dn_loss_bbox: 0.8782 d1.dn_loss_iou: 1.3047 d2.dn_loss_cls: 0.6116 d2.dn_loss_bbox: 0.8877 d2.dn_loss_iou: 1.3067 d3.dn_loss_cls: 0.5948 d3.dn_loss_bbox: 0.8998 d3.dn_loss_iou: 1.3106 d4.dn_loss_cls: 0.5818 d4.dn_loss_bbox: 0.9138 d4.dn_loss_iou: 1.3159
07/04 08:08:44 - mmengine - INFO - Epoch(train) [1][ 300/7393] base_lr: 1.5043e-05 lr: 1.5043e-05 eta: 3 days, 10:57:54 time: 0.5481 data_time: 0.0157 memory: 14765 grad_norm: 60.9143 loss: 42.8167 loss_cls: 0.3396 loss_bbox: 1.4390 loss_iou: 1.7847 d0.loss_cls: 0.2928 d0.loss_bbox: 1.5773 d0.loss_iou: 1.9057 d1.loss_cls: 0.3062 d1.loss_bbox: 1.5216 d1.loss_iou: 1.8683 d2.loss_cls: 0.3159 d2.loss_bbox: 1.4885 d2.loss_iou: 1.8307 d3.loss_cls: 0.3284 d3.loss_bbox: 1.4630 d3.loss_iou: 1.8087 d4.loss_cls: 0.3335 d4.loss_bbox: 1.4500 d4.loss_iou: 1.7958 enc_loss_cls: 0.2918 enc_loss_bbox: 1.6272 enc_loss_iou: 1.9487 dn_loss_cls: 0.5238 dn_loss_bbox: 0.9650 dn_loss_iou: 1.3762 d0.dn_loss_cls: 0.6764 d0.dn_loss_bbox: 0.8680 d0.dn_loss_iou: 1.3426 d1.dn_loss_cls: 0.6009 d1.dn_loss_bbox: 0.8842 d1.dn_loss_iou: 1.3468 d2.dn_loss_cls: 0.5667 d2.dn_loss_bbox: 0.9087 d2.dn_loss_iou: 1.3543 d3.dn_loss_cls: 0.5454 d3.dn_loss_bbox: 0.9308 d3.dn_loss_iou: 1.3615 d4.dn_loss_cls: 0.5290 d4.dn_loss_bbox: 0.9492 d4.dn_loss_iou: 1.3695
07/04 08:09:11 - mmengine - INFO - Epoch(train) [1][ 350/7393] base_lr: 1.7541e-05 lr: 1.7541e-05 eta: 3 days, 10:43:19 time: 0.5503 data_time: 0.0154 memory: 14765 grad_norm: 61.2643 loss: 40.4907 loss_cls: 0.3642 loss_bbox: 1.2941 loss_iou: 1.5588 d0.loss_cls: 0.3258 d0.loss_bbox: 1.3989 d0.loss_iou: 1.6473 d1.loss_cls: 0.3411 d1.loss_bbox: 1.3493 d1.loss_iou: 1.6065 d2.loss_cls: 0.3508 d2.loss_bbox: 1.3193 d2.loss_iou: 1.5840 d3.loss_cls: 0.3579 d3.loss_bbox: 1.3090 d3.loss_iou: 1.5727 d4.loss_cls: 0.3606 d4.loss_bbox: 1.2991 d4.loss_iou: 1.5673 enc_loss_cls: 0.3007 enc_loss_bbox: 1.4747 enc_loss_iou: 1.7077 dn_loss_cls: 0.5109 dn_loss_bbox: 1.0116 dn_loss_iou: 1.3840 d0.dn_loss_cls: 0.6636 d0.dn_loss_bbox: 0.9152 d0.dn_loss_iou: 1.3475 d1.dn_loss_cls: 0.5838 d1.dn_loss_bbox: 0.9456 d1.dn_loss_iou: 1.3578 d2.dn_loss_cls: 0.5458 d2.dn_loss_bbox: 0.9740 d2.dn_loss_iou: 1.3692 d3.dn_loss_cls: 0.5266 d3.dn_loss_bbox: 0.9921 d3.dn_loss_iou: 1.3761 d4.dn_loss_cls: 0.5128 d4.dn_loss_bbox: 1.0034 d4.dn_loss_iou: 1.3807
07/04 08:09:39 - mmengine - INFO - Epoch(train) [1][ 400/7393] base_lr: 2.0040e-05 lr: 2.0040e-05 eta: 3 days, 10:31:04 time: 0.5492 data_time: 0.0151 memory: 14765 grad_norm: 33.5409 loss: 40.0292 loss_cls: 0.4119 loss_bbox: 1.2533 loss_iou: 1.6856 d0.loss_cls: 0.4111 d0.loss_bbox: 1.3096 d0.loss_iou: 1.7348 d1.loss_cls: 0.4076 d1.loss_bbox: 1.2832 d1.loss_iou: 1.7126 d2.loss_cls: 0.4102 d2.loss_bbox: 1.2721 d2.loss_iou: 1.7005 d3.loss_cls: 0.4077 d3.loss_bbox: 1.2651 d3.loss_iou: 1.6928 d4.loss_cls: 0.4102 d4.loss_bbox: 1.2612 d4.loss_iou: 1.6903 enc_loss_cls: 0.4030 enc_loss_bbox: 1.3718 enc_loss_iou: 1.7791 dn_loss_cls: 0.4826 dn_loss_bbox: 0.8943 dn_loss_iou: 1.3057 d0.dn_loss_cls: 0.6012 d0.dn_loss_bbox: 0.8433 d0.dn_loss_iou: 1.2954 d1.dn_loss_cls: 0.5268 d1.dn_loss_bbox: 0.8680 d1.dn_loss_iou: 1.3004 d2.dn_loss_cls: 0.4945 d2.dn_loss_bbox: 0.8828 d2.dn_loss_iou: 1.3042 d3.dn_loss_cls: 0.4828 d3.dn_loss_bbox: 0.8900 d3.dn_loss_iou: 1.3055 d4.dn_loss_cls: 0.4786 d4.dn_loss_bbox: 0.8933 d4.dn_loss_iou: 1.3059
07/04 08:10:06 - mmengine - INFO - Epoch(train) [1][ 450/7393] base_lr: 2.2539e-05 lr: 2.2539e-05 eta: 3 days, 10:23:02 time: 0.5508 data_time: 0.0157 memory: 14765 grad_norm: 35.2407 loss: 38.5235 loss_cls: 0.4331 loss_bbox: 1.1496 loss_iou: 1.6211 d0.loss_cls: 0.4369 d0.loss_bbox: 1.1949 d0.loss_iou: 1.6464 d1.loss_cls: 0.4298 d1.loss_bbox: 1.1862 d1.loss_iou: 1.6364 d2.loss_cls: 0.4318 d2.loss_bbox: 1.1758 d2.loss_iou: 1.6313 d3.loss_cls: 0.4312 d3.loss_bbox: 1.1659 d3.loss_iou: 1.6291 d4.loss_cls: 0.4283 d4.loss_bbox: 1.1588 d4.loss_iou: 1.6282 enc_loss_cls: 0.4427 enc_loss_bbox: 1.2188 enc_loss_iou: 1.6747 dn_loss_cls: 0.4748 dn_loss_bbox: 0.8346 dn_loss_iou: 1.2997 d0.dn_loss_cls: 0.5773 d0.dn_loss_bbox: 0.8328 d0.dn_loss_iou: 1.2894 d1.dn_loss_cls: 0.5104 d1.dn_loss_bbox: 0.8386 d1.dn_loss_iou: 1.2899 d2.dn_loss_cls: 0.4853 d2.dn_loss_bbox: 0.8405 d2.dn_loss_iou: 1.2908 d3.dn_loss_cls: 0.4742 d3.dn_loss_bbox: 0.8398 d3.dn_loss_iou: 1.2923 d4.dn_loss_cls: 0.4703 d4.dn_loss_bbox: 0.8376 d4.dn_loss_iou: 1.2944
07/04 08:10:34 - mmengine - INFO - Epoch(train) [1][ 500/7393] base_lr: 2.5038e-05 lr: 2.5038e-05 eta: 3 days, 10:10:42 time: 0.5443 data_time: 0.0147 memory: 14765 grad_norm: 37.2461 loss: 38.9175 loss_cls: 0.4576 loss_bbox: 1.1115 loss_iou: 1.5468 d0.loss_cls: 0.4658 d0.loss_bbox: 1.1688 d0.loss_iou: 1.5733 d1.loss_cls: 0.4631 d1.loss_bbox: 1.1504 d1.loss_iou: 1.5595 d2.loss_cls: 0.4593 d2.loss_bbox: 1.1382 d2.loss_iou: 1.5545 d3.loss_cls: 0.4599 d3.loss_bbox: 1.1282 d3.loss_iou: 1.5493 d4.loss_cls: 0.4554 d4.loss_bbox: 1.1210 d4.loss_iou: 1.5480 enc_loss_cls: 0.4715 enc_loss_bbox: 1.1972 enc_loss_iou: 1.5967 dn_loss_cls: 0.4749 dn_loss_bbox: 0.9303 dn_loss_iou: 1.3639 d0.dn_loss_cls: 0.5827 d0.dn_loss_bbox: 0.9371 d0.dn_loss_iou: 1.3517 d1.dn_loss_cls: 0.5124 d1.dn_loss_bbox: 0.9391 d1.dn_loss_iou: 1.3514 d2.dn_loss_cls: 0.4858 d2.dn_loss_bbox: 0.9377 d2.dn_loss_iou: 1.3518 d3.dn_loss_cls: 0.4741 d3.dn_loss_bbox: 0.9345 d3.dn_loss_iou: 1.3539 d4.dn_loss_cls: 0.4702 d4.dn_loss_bbox: 0.9320 d4.dn_loss_iou: 1.3578
07/04 08:11:01 - mmengine - INFO - Epoch(train) [1][ 550/7393] base_lr: 2.7536e-05 lr: 2.7536e-05 eta: 3 days, 10:00:12 time: 0.5439 data_time: 0.0147 memory: 14758 grad_norm: 36.4342 loss: 36.6310 loss_cls: 0.4611 loss_bbox: 1.0025 loss_iou: 1.4896 d0.loss_cls: 0.4715 d0.loss_bbox: 1.0635 d0.loss_iou: 1.5199 d1.loss_cls: 0.4629 d1.loss_bbox: 1.0497 d1.loss_iou: 1.5036 d2.loss_cls: 0.4585 d2.loss_bbox: 1.0311 d2.loss_iou: 1.5001 d3.loss_cls: 0.4554 d3.loss_bbox: 1.0175 d3.loss_iou: 1.4960 d4.loss_cls: 0.4567 d4.loss_bbox: 1.0077 d4.loss_iou: 1.4940 enc_loss_cls: 0.4967 enc_loss_bbox: 1.0838 enc_loss_iou: 1.5300 dn_loss_cls: 0.4302 dn_loss_bbox: 0.8524 dn_loss_iou: 1.2964 d0.dn_loss_cls: 0.5308 d0.dn_loss_bbox: 0.8607 d0.dn_loss_iou: 1.2806 d1.dn_loss_cls: 0.4662 d1.dn_loss_bbox: 0.8604 d1.dn_loss_iou: 1.2808 d2.dn_loss_cls: 0.4389 d2.dn_loss_bbox: 0.8570 d2.dn_loss_iou: 1.2840 d3.dn_loss_cls: 0.4270 d3.dn_loss_bbox: 0.8542 d3.dn_loss_iou: 1.2887 d4.dn_loss_cls: 0.4255 d4.dn_loss_bbox: 0.8530 d4.dn_loss_iou: 1.2925
07/04 08:11:28 - mmengine - INFO - Epoch(train) [1][ 600/7393] base_lr: 3.0035e-05 lr: 3.0035e-05 eta: 3 days, 9:55:35 time: 0.5495 data_time: 0.0150 memory: 14765 grad_norm: 45.7821 loss: 35.4919 loss_cls: 0.4403 loss_bbox: 0.9375 loss_iou: 1.5234 d0.loss_cls: 0.4348 d0.loss_bbox: 1.0059 d0.loss_iou: 1.5596 d1.loss_cls: 0.4300 d1.loss_bbox: 0.9801 d1.loss_iou: 1.5467 d2.loss_cls: 0.4333 d2.loss_bbox: 0.9603 d2.loss_iou: 1.5374 d3.loss_cls: 0.4344 d3.loss_bbox: 0.9509 d3.loss_iou: 1.5301 d4.loss_cls: 0.4367 d4.loss_bbox: 0.9454 d4.loss_iou: 1.5264 enc_loss_cls: 0.4499 enc_loss_bbox: 1.0371 enc_loss_iou: 1.5808 dn_loss_cls: 0.3975 dn_loss_bbox: 0.7696 dn_loss_iou: 1.2939 d0.dn_loss_cls: 0.4900 d0.dn_loss_bbox: 0.7727 d0.dn_loss_iou: 1.2678 d1.dn_loss_cls: 0.4287 d1.dn_loss_bbox: 0.7711 d1.dn_loss_iou: 1.2699 d2.dn_loss_cls: 0.4052 d2.dn_loss_bbox: 0.7693 d2.dn_loss_iou: 1.2763 d3.dn_loss_cls: 0.3947 d3.dn_loss_bbox: 0.7690 d3.dn_loss_iou: 1.2838 d4.dn_loss_cls: 0.3937 d4.dn_loss_bbox: 0.7690 d4.dn_loss_iou: 1.2887
07/04 08:11:56 - mmengine - INFO - Epoch(train) [1][ 650/7393] base_lr: 3.2534e-05 lr: 3.2534e-05 eta: 3 days, 9:52:00 time: 0.5501 data_time: 0.0147 memory: 14765 grad_norm: 41.7852 loss: 36.1583 loss_cls: 0.4479 loss_bbox: 0.8858 loss_iou: 1.4241 d0.loss_cls: 0.4368 d0.loss_bbox: 0.9667 d0.loss_iou: 1.4606 d1.loss_cls: 0.4383 d1.loss_bbox: 0.9283 d1.loss_iou: 1.4457 d2.loss_cls: 0.4392 d2.loss_bbox: 0.9078 d2.loss_iou: 1.4402 d3.loss_cls: 0.4414 d3.loss_bbox: 0.8961 d3.loss_iou: 1.4330 d4.loss_cls: 0.4448 d4.loss_bbox: 0.8903 d4.loss_iou: 1.4293 enc_loss_cls: 0.4472 enc_loss_bbox: 1.0002 enc_loss_iou: 1.4818 dn_loss_cls: 0.4390 dn_loss_bbox: 0.9109 dn_loss_iou: 1.3878 d0.dn_loss_cls: 0.5305 d0.dn_loss_bbox: 0.9128 d0.dn_loss_iou: 1.3656 d1.dn_loss_cls: 0.4638 d1.dn_loss_bbox: 0.9099 d1.dn_loss_iou: 1.3703 d2.dn_loss_cls: 0.4431 d2.dn_loss_bbox: 0.9092 d2.dn_loss_iou: 1.3773 d3.dn_loss_cls: 0.4326 d3.dn_loss_bbox: 0.9097 d3.dn_loss_iou: 1.3823 d4.dn_loss_cls: 0.4329 d4.dn_loss_bbox: 0.9099 d4.dn_loss_iou: 1.3852
07/04 08:12:23 - mmengine - INFO - Epoch(train) [1][ 700/7393] base_lr: 3.5033e-05 lr: 3.5033e-05 eta: 3 days, 9:46:36 time: 0.5465 data_time: 0.0154 memory: 14765 grad_norm: 44.8541 loss: 37.7555 loss_cls: 0.5058 loss_bbox: 0.9555 loss_iou: 1.5328 d0.loss_cls: 0.4985 d0.loss_bbox: 1.0394 d0.loss_iou: 1.5688 d1.loss_cls: 0.4966 d1.loss_bbox: 0.9987 d1.loss_iou: 1.5547 d2.loss_cls: 0.5005 d2.loss_bbox: 0.9758 d2.loss_iou: 1.5454 d3.loss_cls: 0.4992 d3.loss_bbox: 0.9658 d3.loss_iou: 1.5414 d4.loss_cls: 0.5030 d4.loss_bbox: 0.9641 d4.loss_iou: 1.5358 enc_loss_cls: 0.5167 enc_loss_bbox: 1.0803 enc_loss_iou: 1.5922 dn_loss_cls: 0.4169 dn_loss_bbox: 0.9093 dn_loss_iou: 1.4012 d0.dn_loss_cls: 0.5026 d0.dn_loss_bbox: 0.9080 d0.dn_loss_iou: 1.3735 d1.dn_loss_cls: 0.4408 d1.dn_loss_bbox: 0.9062 d1.dn_loss_iou: 1.3805 d2.dn_loss_cls: 0.4196 d2.dn_loss_bbox: 0.9070 d2.dn_loss_iou: 1.3888 d3.dn_loss_cls: 0.4107 d3.dn_loss_bbox: 0.9079 d3.dn_loss_iou: 1.3946 d4.dn_loss_cls: 0.4106 d4.dn_loss_bbox: 0.9083 d4.dn_loss_iou: 1.3980
07/04 08:12:50 - mmengine - INFO - Epoch(train) [1][ 750/7393] base_lr: 3.7531e-05 lr: 3.7531e-05 eta: 3 days, 9:40:59 time: 0.5451 data_time: 0.0148 memory: 14765 grad_norm: 45.8713 loss: 37.4994 loss_cls: 0.5150 loss_bbox: 0.9276 loss_iou: 1.5517 d0.loss_cls: 0.5015 d0.loss_bbox: 1.0159 d0.loss_iou: 1.5910 d1.loss_cls: 0.5114 d1.loss_bbox: 0.9666 d1.loss_iou: 1.5669 d2.loss_cls: 0.5097 d2.loss_bbox: 0.9458 d2.loss_iou: 1.5598 d3.loss_cls: 0.5093 d3.loss_bbox: 0.9361 d3.loss_iou: 1.5568 d4.loss_cls: 0.5111 d4.loss_bbox: 0.9320 d4.loss_iou: 1.5540 enc_loss_cls: 0.5083 enc_loss_bbox: 1.0715 enc_loss_iou: 1.6195 dn_loss_cls: 0.4075 dn_loss_bbox: 0.8807 dn_loss_iou: 1.4021 d0.dn_loss_cls: 0.4841 d0.dn_loss_bbox: 0.8750 d0.dn_loss_iou: 1.3766 d1.dn_loss_cls: 0.4240 d1.dn_loss_bbox: 0.8738 d1.dn_loss_iou: 1.3849 d2.dn_loss_cls: 0.4053 d2.dn_loss_bbox: 0.8765 d2.dn_loss_iou: 1.3922 d3.dn_loss_cls: 0.3988 d3.dn_loss_bbox: 0.8783 d3.dn_loss_iou: 1.3974 d4.dn_loss_cls: 0.4011 d4.dn_loss_bbox: 0.8793 d4.dn_loss_iou: 1.3999
07/04 08:13:18 - mmengine - INFO - Epoch(train) [1][ 800/7393] base_lr: 4.0030e-05 lr: 4.0030e-05 eta: 3 days, 9:33:46 time: 0.5410 data_time: 0.0151 memory: 14765 grad_norm: 45.5470 loss: 36.6604 loss_cls: 0.4964 loss_bbox: 0.9202 loss_iou: 1.5986 d0.loss_cls: 0.4949 d0.loss_bbox: 1.0054 d0.loss_iou: 1.6342 d1.loss_cls: 0.4981 d1.loss_bbox: 0.9539 d1.loss_iou: 1.6156 d2.loss_cls: 0.4967 d2.loss_bbox: 0.9362 d2.loss_iou: 1.6053 d3.loss_cls: 0.4918 d3.loss_bbox: 0.9326 d3.loss_iou: 1.6021 d4.loss_cls: 0.4931 d4.loss_bbox: 0.9266 d4.loss_iou: 1.6014 enc_loss_cls: 0.5155 enc_loss_bbox: 1.0780 enc_loss_iou: 1.6599 dn_loss_cls: 0.3780 dn_loss_bbox: 0.8335 dn_loss_iou: 1.3062 d0.dn_loss_cls: 0.4438 d0.dn_loss_bbox: 0.8213 d0.dn_loss_iou: 1.2897 d1.dn_loss_cls: 0.3920 d1.dn_loss_bbox: 0.8235 d1.dn_loss_iou: 1.2961 d2.dn_loss_cls: 0.3763 d2.dn_loss_bbox: 0.8283 d2.dn_loss_iou: 1.3011 d3.dn_loss_cls: 0.3707 d3.dn_loss_bbox: 0.8302 d3.dn_loss_iou: 1.3037 d4.dn_loss_cls: 0.3725 d4.dn_loss_bbox: 0.8317 d4.dn_loss_iou: 1.3049
# ppdet 8bs x 2GPU
[07/04 07:42:40] ppdet.engine INFO: Epoch: [0] [ 0/7329] learning_rate: 0.000000 loss_class: 0.683346 loss_bbox: 1.285977 loss_giou: 0.964165 loss_class_aux: 1.870583 loss_bbox_aux: 7.731508 loss_giou_aux: 5.904184 loss_class_dn: 0.686849 loss_bbox_dn: 0.716623 loss_giou_dn: 0.833695 loss_class_aux_dn: 2.812277 loss_bbox_aux_dn: 3.583115 loss_giou_aux_dn: 4.168473 loss: 31.240795 eta: 21 days, 6:44:52 batch_cost: 3.4844 data_cost: 0.0005 ips: 2.2959 images/s
[07/04 07:43:03] ppdet.engine INFO: Epoch: [0] [ 50/7329] learning_rate: 0.000003 loss_class: 0.998480 loss_bbox: 1.382280 loss_giou: 1.748477 loss_class_aux: 1.655656 loss_bbox_aux: 8.407383 loss_giou_aux: 10.586290 loss_class_dn: 1.017919 loss_bbox_dn: 0.802868 loss_giou_dn: 1.311996 loss_class_aux_dn: 4.180416 loss_bbox_aux_dn: 4.014454 loss_giou_aux_dn: 6.561262 loss: 43.388004 eta: 2 days, 21:29:28 batch_cost: 0.4139 data_cost: 0.0004 ips: 19.3273 images/s
[07/04 07:43:25] ppdet.engine INFO: Epoch: [0] [ 100/7329] learning_rate: 0.000005 loss_class: 0.944255 loss_bbox: 1.379791 loss_giou: 1.742733 loss_class_aux: 1.795154 loss_bbox_aux: 8.391826 loss_giou_aux: 10.500675 loss_class_dn: 0.946447 loss_bbox_dn: 0.770575 loss_giou_dn: 1.316061 loss_class_aux_dn: 3.990767 loss_bbox_aux_dn: 3.848480 loss_giou_aux_dn: 6.583784 loss: 41.833054 eta: 2 days, 16:23:24 batch_cost: 0.4039 data_cost: 0.0004 ips: 19.8064 images/s
[07/04 07:43:47] ppdet.engine INFO: Epoch: [0] [ 150/7329] learning_rate: 0.000008 loss_class: 0.865210 loss_bbox: 1.466716 loss_giou: 1.591932 loss_class_aux: 1.756492 loss_bbox_aux: 8.951975 loss_giou_aux: 9.711206 loss_class_dn: 0.842548 loss_bbox_dn: 0.834488 loss_giou_dn: 1.259095 loss_class_aux_dn: 3.579963 loss_bbox_aux_dn: 4.175209 loss_giou_aux_dn: 6.308812 loss: 42.384117 eta: 2 days, 14:14:23 batch_cost: 0.3952 data_cost: 0.0004 ips: 20.2440 images/s
[07/04 07:44:10] ppdet.engine INFO: Epoch: [0] [ 200/7329] learning_rate: 0.000010 loss_class: 0.821579 loss_bbox: 1.468903 loss_giou: 1.514127 loss_class_aux: 1.873052 loss_bbox_aux: 9.161399 loss_giou_aux: 9.294691 loss_class_dn: 0.771384 loss_bbox_dn: 0.882304 loss_giou_dn: 1.211941 loss_class_aux_dn: 3.285707 loss_bbox_aux_dn: 4.335419 loss_giou_aux_dn: 6.048694 loss: 40.782269 eta: 2 days, 13:46:32 batch_cost: 0.4122 data_cost: 0.0004 ips: 19.4101 images/s
[07/04 07:44:33] ppdet.engine INFO: Epoch: [0] [ 250/7329] learning_rate: 0.000013 loss_class: 0.827880 loss_bbox: 1.381951 loss_giou: 1.455075 loss_class_aux: 2.192601 loss_bbox_aux: 8.752991 loss_giou_aux: 9.094400 loss_class_dn: 0.770761 loss_bbox_dn: 0.938643 loss_giou_dn: 1.203800 loss_class_aux_dn: 3.360187 loss_bbox_aux_dn: 4.527280 loss_giou_aux_dn: 6.029949 loss: 41.134167 eta: 2 days, 13:39:11 batch_cost: 0.4176 data_cost: 0.0004 ips: 19.1563 images/s
[07/04 07:44:54] ppdet.engine INFO: Epoch: [0] [ 300/7329] learning_rate: 0.000015 loss_class: 0.878376 loss_bbox: 1.304542 loss_giou: 1.526390 loss_class_aux: 2.675135 loss_bbox_aux: 8.318256 loss_giou_aux: 9.266665 loss_class_dn: 0.797022 loss_bbox_dn: 0.914100 loss_giou_dn: 1.320687 loss_class_aux_dn: 3.393513 loss_bbox_aux_dn: 4.405071 loss_giou_aux_dn: 6.539315 loss: 43.072666 eta: 2 days, 12:48:05 batch_cost: 0.3861 data_cost: 0.0004 ips: 20.7226 images/s
[07/04 07:45:16] ppdet.engine INFO: Epoch: [0] [ 350/7329] learning_rate: 0.000018 loss_class: 0.894463 loss_bbox: 1.256612 loss_giou: 1.769405 loss_class_aux: 2.898498 loss_bbox_aux: 8.179535 loss_giou_aux: 10.855463 loss_class_dn: 0.813604 loss_bbox_dn: 0.898470 loss_giou_dn: 1.532507 loss_class_aux_dn: 3.429995 loss_bbox_aux_dn: 4.291843 loss_giou_aux_dn: 7.670493 loss: 44.766773 eta: 2 days, 12:33:08 batch_cost: 0.4034 data_cost: 0.0004 ips: 19.8330 images/s
[07/04 07:45:38] ppdet.engine INFO: Epoch: [0] [ 400/7329] learning_rate: 0.000020 loss_class: 0.790304 loss_bbox: 1.028785 loss_giou: 1.359278 loss_class_aux: 3.110061 loss_bbox_aux: 6.837068 loss_giou_aux: 8.406872 loss_class_dn: 0.694199 loss_bbox_dn: 0.847215 loss_giou_dn: 1.303349 loss_class_aux_dn: 3.033953 loss_bbox_aux_dn: 4.142806 loss_giou_aux_dn: 6.541418 loss: 38.661110 eta: 2 days, 12:10:25 batch_cost: 0.3930 data_cost: 0.0004 ips: 20.3581 images/s
[07/04 07:45:59] ppdet.engine INFO: Epoch: [0] [ 450/7329] learning_rate: 0.000023 loss_class: 0.831289 loss_bbox: 0.900885 loss_giou: 1.274977 loss_class_aux: 3.937345 loss_bbox_aux: 6.190751 loss_giou_aux: 8.041082 loss_class_dn: 0.692314 loss_bbox_dn: 0.861287 loss_giou_dn: 1.369993 loss_class_aux_dn: 3.098099 loss_bbox_aux_dn: 4.358569 loss_giou_aux_dn: 6.602709 loss: 38.432137 eta: 2 days, 11:41:49 batch_cost: 0.3818 data_cost: 0.0004 ips: 20.9528 images/s
[07/04 07:46:20] ppdet.engine INFO: Epoch: [0] [ 500/7329] learning_rate: 0.000025 loss_class: 0.768482 loss_bbox: 0.760391 loss_giou: 1.233569 loss_class_aux: 3.949719 loss_bbox_aux: 4.985241 loss_giou_aux: 7.611237 loss_class_dn: 0.649078 loss_bbox_dn: 0.782088 loss_giou_dn: 1.294421 loss_class_aux_dn: 2.896272 loss_bbox_aux_dn: 3.969111 loss_giou_aux_dn: 6.413960 loss: 36.220665 eta: 2 days, 11:22:51 batch_cost: 0.3864 data_cost: 0.0004 ips: 20.7046 images/s
[07/04 07:46:42] ppdet.engine INFO: Epoch: [0] [ 550/7329] learning_rate: 0.000028 loss_class: 0.728091 loss_bbox: 0.776833 loss_giou: 1.123835 loss_class_aux: 3.761339 loss_bbox_aux: 5.018996 loss_giou_aux: 7.186420 loss_class_dn: 0.616948 loss_bbox_dn: 0.788702 loss_giou_dn: 1.239618 loss_class_aux_dn: 2.915491 loss_bbox_aux_dn: 3.992010 loss_giou_aux_dn: 6.192928 loss: 36.567657 eta: 2 days, 11:14:37 batch_cost: 0.3956 data_cost: 0.0004 ips: 20.2233 images/s
[07/04 07:47:04] ppdet.engine INFO: Epoch: [0] [ 600/7329] learning_rate: 0.000030 loss_class: 0.763762 loss_bbox: 0.656370 loss_giou: 1.219659 loss_class_aux: 4.180519 loss_bbox_aux: 4.357739 loss_giou_aux: 7.853674 loss_class_dn: 0.616509 loss_bbox_dn: 0.794917 loss_giou_dn: 1.358421 loss_class_aux_dn: 2.903438 loss_bbox_aux_dn: 4.027238 loss_giou_aux_dn: 6.720316 loss: 35.873466 eta: 2 days, 11:02:11 batch_cost: 0.3881 data_cost: 0.0004 ips: 20.6158 images/s
[07/04 07:47:25] ppdet.engine INFO: Epoch: [0] [ 650/7329] learning_rate: 0.000033 loss_class: 0.737461 loss_bbox: 0.583091 loss_giou: 1.170799 loss_class_aux: 4.762381 loss_bbox_aux: 3.967538 loss_giou_aux: 7.494648 loss_class_dn: 0.605938 loss_bbox_dn: 0.730950 loss_giou_dn: 1.339651 loss_class_aux_dn: 2.837840 loss_bbox_aux_dn: 3.697802 loss_giou_aux_dn: 6.588319 loss: 35.522499 eta: 2 days, 10:55:36 batch_cost: 0.3939 data_cost: 0.0004 ips: 20.3075 images/s
[07/04 07:47:48] ppdet.engine INFO: Epoch: [0] [ 700/7329] learning_rate: 0.000035 loss_class: 0.731348 loss_bbox: 0.576231 loss_giou: 1.091594 loss_class_aux: 5.038460 loss_bbox_aux: 3.846195 loss_giou_aux: 7.036030 loss_class_dn: 0.584472 loss_bbox_dn: 0.741416 loss_giou_dn: 1.262873 loss_class_aux_dn: 2.908104 loss_bbox_aux_dn: 3.811942 loss_giou_aux_dn: 6.383394 loss: 35.792530 eta: 2 days, 10:59:28 batch_cost: 0.4092 data_cost: 0.0004 ips: 19.5496 images/s
[07/04 07:48:11] ppdet.engine INFO: Epoch: [0] [ 750/7329] learning_rate: 0.000038 loss_class: 0.719343 loss_bbox: 0.521474 loss_giou: 0.999734 loss_class_aux: 5.578853 loss_bbox_aux: 3.550250 loss_giou_aux: 6.466695 loss_class_dn: 0.550823 loss_bbox_dn: 0.694916 loss_giou_dn: 1.221618 loss_class_aux_dn: 2.914670 loss_bbox_aux_dn: 3.519604 loss_giou_aux_dn: 6.196606 loss: 33.507900 eta: 2 days, 11:06:23 batch_cost: 0.4154 data_cost: 0.0004 ips: 19.2591 images/s
[07/04 07:48:33] ppdet.engine INFO: Epoch: [0] [ 800/7329] learning_rate: 0.000040 loss_class: 0.676013 loss_bbox: 0.478862 loss_giou: 1.018566 loss_class_aux: 4.934841 loss_bbox_aux: 3.229126 loss_giou_aux: 6.532661 loss_class_dn: 0.514488 loss_bbox_dn: 0.692227 loss_giou_dn: 1.252319 loss_class_aux_dn: 2.825400 loss_bbox_aux_dn: 3.580340 loss_giou_aux_dn: 6.327948 loss: 31.801727 eta: 2 days, 11:08:38 batch_cost: 0.4085 data_cost: 0.0004 ips: 19.5827 images/s
Above log from ppdet has an issue. https://github.com/PaddlePaddle/PaddleDetection/pull/8409 I used the latest commit and it has a bug to set iou_score. After fixing, I got below logs from ppdet.
# ppdet 8bs x 2GPU
[07/04 15:02:50] ppdet.engine INFO: Epoch: [0] [ 0/7329] learning_rate: 0.000000 loss_class: 0.201641 loss_bbox: 1.727503 loss_giou: 1.307491 loss_class_aux: 1.212130 loss_bbox_aux: 10.493377 loss_giou_aux: 7.949263 loss_class_dn: 0.635403 loss_bbox_dn: 0.939284 loss_giou_dn: 1.015846 loss_class_aux_dn: 3.224553 loss_bbox_aux_dn: 4.696418 loss_giou_aux_dn: 5.079230 loss: 38.482140 eta: 14 days, 23:48:57 batch_cost: 2.4547 data_cost: 0.0005 ips: 3.2590 images/s
[07/04 15:03:13] ppdet.engine INFO: Epoch: [0] [ 50/7329] learning_rate: 0.000003 loss_class: 0.297633 loss_bbox: 1.479268 loss_giou: 1.815598 loss_class_aux: 1.675295 loss_bbox_aux: 9.019381 loss_giou_aux: 10.936712 loss_class_dn: 0.912077 loss_bbox_dn: 0.838664 loss_giou_dn: 1.344816 loss_class_aux_dn: 4.453175 loss_bbox_aux_dn: 4.192844 loss_giou_aux_dn: 6.724151 loss: 46.462036 eta: 2 days, 17:45:22 batch_cost: 0.4085 data_cost: 0.0005 ips: 19.5827 images/s
[07/04 15:03:35] ppdet.engine INFO: Epoch: [0] [ 100/7329] learning_rate: 0.000005 loss_class: 0.285327 loss_bbox: 1.477210 loss_giou: 1.603661 loss_class_aux: 1.596411 loss_bbox_aux: 8.960684 loss_giou_aux: 9.666679 loss_class_dn: 0.779881 loss_bbox_dn: 0.856219 loss_giou_dn: 1.201171 loss_class_aux_dn: 3.744936 loss_bbox_aux_dn: 4.275051 loss_giou_aux_dn: 6.006138 loss: 42.018841 eta: 2 days, 14:01:19 batch_cost: 0.3973 data_cost: 0.0005 ips: 20.1378 images/s
[07/04 15:03:57] ppdet.engine INFO: Epoch: [0] [ 150/7329] learning_rate: 0.000008 loss_class: 0.275387 loss_bbox: 1.372201 loss_giou: 1.629953 loss_class_aux: 1.592410 loss_bbox_aux: 8.422577 loss_giou_aux: 9.954062 loss_class_dn: 0.749548 loss_bbox_dn: 0.786661 loss_giou_dn: 1.268025 loss_class_aux_dn: 3.732531 loss_bbox_aux_dn: 3.917260 loss_giou_aux_dn: 6.346529 loss: 41.343445 eta: 2 days, 12:48:22 batch_cost: 0.3983 data_cost: 0.0005 ips: 20.0870 images/s
[07/04 15:04:19] ppdet.engine INFO: Epoch: [0] [ 200/7329] learning_rate: 0.000010 loss_class: 0.349416 loss_bbox: 1.482662 loss_giou: 1.670851 loss_class_aux: 1.887396 loss_bbox_aux: 9.177547 loss_giou_aux: 10.195324 loss_class_dn: 0.698555 loss_bbox_dn: 0.891267 loss_giou_dn: 1.347793 loss_class_aux_dn: 3.470482 loss_bbox_aux_dn: 4.455425 loss_giou_aux_dn: 6.663803 loss: 41.833828 eta: 2 days, 12:38:50 batch_cost: 0.4108 data_cost: 0.0006 ips: 19.4765 images/s
[07/04 15:04:41] ppdet.engine INFO: Epoch: [0] [ 250/7329] learning_rate: 0.000013 loss_class: 0.381666 loss_bbox: 1.361250 loss_giou: 1.545455 loss_class_aux: 2.009711 loss_bbox_aux: 8.443027 loss_giou_aux: 9.609318 loss_class_dn: 0.675151 loss_bbox_dn: 0.908471 loss_giou_dn: 1.278737 loss_class_aux_dn: 3.423970 loss_bbox_aux_dn: 4.396884 loss_giou_aux_dn: 6.381564 loss: 41.208412 eta: 2 days, 12:07:38 batch_cost: 0.3963 data_cost: 0.0005 ips: 20.1872 images/s
[07/04 15:05:03] ppdet.engine INFO: Epoch: [0] [ 300/7329] learning_rate: 0.000015 loss_class: 0.455796 loss_bbox: 1.304743 loss_giou: 1.480081 loss_class_aux: 2.504373 loss_bbox_aux: 8.138662 loss_giou_aux: 9.187222 loss_class_dn: 0.657482 loss_bbox_dn: 0.949222 loss_giou_dn: 1.317122 loss_class_aux_dn: 3.362846 loss_bbox_aux_dn: 4.543368 loss_giou_aux_dn: 6.522356 loss: 40.966015 eta: 2 days, 11:42:11 batch_cost: 0.3932 data_cost: 0.0005 ips: 20.3464 images/s
[07/04 15:05:26] ppdet.engine INFO: Epoch: [0] [ 350/7329] learning_rate: 0.000018 loss_class: 0.530600 loss_bbox: 1.225484 loss_giou: 1.654275 loss_class_aux: 2.943487 loss_bbox_aux: 7.658451 loss_giou_aux: 10.139181 loss_class_dn: 0.690101 loss_bbox_dn: 0.894718 loss_giou_dn: 1.529423 loss_class_aux_dn: 3.481877 loss_bbox_aux_dn: 4.247900 loss_giou_aux_dn: 7.611682 loss: 44.020210 eta: 2 days, 11:53:08 batch_cost: 0.4166 data_cost: 0.0006 ips: 19.2046 images/s
[07/04 15:05:48] ppdet.engine INFO: Epoch: [0] [ 400/7329] learning_rate: 0.000020 loss_class: 0.546594 loss_bbox: 1.050263 loss_giou: 1.361742 loss_class_aux: 3.038521 loss_bbox_aux: 6.921156 loss_giou_aux: 8.373913 loss_class_dn: 0.622643 loss_bbox_dn: 0.901651 loss_giou_dn: 1.274494 loss_class_aux_dn: 3.126308 loss_bbox_aux_dn: 4.433243 loss_giou_aux_dn: 6.330893 loss: 39.587692 eta: 2 days, 11:49:14 batch_cost: 0.4056 data_cost: 0.0005 ips: 19.7250 images/s
[07/04 15:06:11] ppdet.engine INFO: Epoch: [0] [ 450/7329] learning_rate: 0.000023 loss_class: 0.600202 loss_bbox: 0.908755 loss_giou: 1.338081 loss_class_aux: 3.383973 loss_bbox_aux: 6.074856 loss_giou_aux: 8.237648 loss_class_dn: 0.610507 loss_bbox_dn: 0.791783 loss_giou_dn: 1.329810 loss_class_aux_dn: 3.103679 loss_bbox_aux_dn: 4.026302 loss_giou_aux_dn: 6.615985 loss: 39.349503 eta: 2 days, 11:42:53 batch_cost: 0.4023 data_cost: 0.0005 ips: 19.8881 images/s
[07/04 15:06:34] ppdet.engine INFO: Epoch: [0] [ 500/7329] learning_rate: 0.000025 loss_class: 0.679902 loss_bbox: 0.800746 loss_giou: 1.200678 loss_class_aux: 3.808671 loss_bbox_aux: 5.183709 loss_giou_aux: 7.519331 loss_class_dn: 0.568733 loss_bbox_dn: 0.755130 loss_giou_dn: 1.312346 loss_class_aux_dn: 2.971537 loss_bbox_aux_dn: 3.829945 loss_giou_aux_dn: 6.568553 loss: 37.149818 eta: 2 days, 11:50:23 batch_cost: 0.4167 data_cost: 0.0005 ips: 19.2000 images/s
[07/04 15:06:56] ppdet.engine INFO: Epoch: [0] [ 550/7329] learning_rate: 0.000028 loss_class: 0.751524 loss_bbox: 0.671623 loss_giou: 1.125915 loss_class_aux: 4.166484 loss_bbox_aux: 4.458362 loss_giou_aux: 6.992559 loss_class_dn: 0.560734 loss_bbox_dn: 0.712450 loss_giou_dn: 1.272195 loss_class_aux_dn: 2.772212 loss_bbox_aux_dn: 3.673186 loss_giou_aux_dn: 6.362156 loss: 34.372982 eta: 2 days, 11:49:32 batch_cost: 0.4080 data_cost: 0.0005 ips: 19.6085 images/s
[07/04 15:07:18] ppdet.engine INFO: Epoch: [0] [ 600/7329] learning_rate: 0.000030 loss_class: 0.812774 loss_bbox: 0.652046 loss_giou: 1.190217 loss_class_aux: 4.567040 loss_bbox_aux: 4.310749 loss_giou_aux: 7.383471 loss_class_dn: 0.585980 loss_bbox_dn: 0.812320 loss_giou_dn: 1.316965 loss_class_aux_dn: 2.868565 loss_bbox_aux_dn: 4.059213 loss_giou_aux_dn: 6.592301 loss: 36.259888 eta: 2 days, 11:45:53 batch_cost: 0.4040 data_cost: 0.0005 ips: 19.8000 images/s
[07/04 15:07:40] ppdet.engine INFO: Epoch: [0] [ 650/7329] learning_rate: 0.000033 loss_class: 0.814888 loss_bbox: 0.652022 loss_giou: 1.263505 loss_class_aux: 4.560444 loss_bbox_aux: 4.430242 loss_giou_aux: 8.005415 loss_class_dn: 0.604041 loss_bbox_dn: 0.820365 loss_giou_dn: 1.305081 loss_class_aux_dn: 2.902153 loss_bbox_aux_dn: 4.106123 loss_giou_aux_dn: 6.590990 loss: 35.503151 eta: 2 days, 11:30:47 batch_cost: 0.3863 data_cost: 0.0005 ips: 20.7074 images/s
[07/04 15:08:03] ppdet.engine INFO: Epoch: [0] [ 700/7329] learning_rate: 0.000035 loss_class: 0.857192 loss_bbox: 0.552369 loss_giou: 1.086274 loss_class_aux: 4.628265 loss_bbox_aux: 3.728952 loss_giou_aux: 7.016881 loss_class_dn: 0.619368 loss_bbox_dn: 0.738904 loss_giou_dn: 1.247447 loss_class_aux_dn: 2.904344 loss_bbox_aux_dn: 3.695195 loss_giou_aux_dn: 6.333776 loss: 34.237926 eta: 2 days, 11:34:39 batch_cost: 0.4132 data_cost: 0.0005 ips: 19.3599 images/s
[07/04 15:08:25] ppdet.engine INFO: Epoch: [0] [ 750/7329] learning_rate: 0.000038 loss_class: 0.908259 loss_bbox: 0.549284 loss_giou: 1.055723 loss_class_aux: 5.006126 loss_bbox_aux: 3.712269 loss_giou_aux: 6.599482 loss_class_dn: 0.584488 loss_bbox_dn: 0.741655 loss_giou_dn: 1.197559 loss_class_aux_dn: 2.794070 loss_bbox_aux_dn: 3.833305 loss_giou_aux_dn: 6.021413 loss: 33.135147 eta: 2 days, 11:34:25 batch_cost: 0.4072 data_cost: 0.0005 ips: 19.6480 images/s
[07/04 15:08:47] ppdet.engine INFO: Epoch: [0] [ 800/7329] learning_rate: 0.000040 loss_class: 0.867941 loss_bbox: 0.485300 loss_giou: 1.014525 loss_class_aux: 4.928567 loss_bbox_aux: 3.233761 loss_giou_aux: 6.424962 loss_class_dn: 0.546647 loss_bbox_dn: 0.710170 loss_giou_dn: 1.178975 loss_class_aux_dn: 2.668061 loss_bbox_aux_dn: 3.592177 loss_giou_aux_dn: 5.991127 loss: 30.773624 eta: 2 days, 11:30:30 batch_cost: 0.4005 data_cost: 0.0004 ips: 19.9771 images/s
@nijkah So does this mean there is an issue with the official code? Or is it that the official code training is fine, but there are issues with reproducing it in mmdetection? Converting from Paddle to PyTorch is difficult, so if it's too challenging, perhaps we can wait for the official release of the PyTorch code or try training rtdetr in yolov8 to see if we can reproduce it.
@hhaAndroid Sorry for confusing. There was a modification to support SSOD in RT-DETR by one who is not original author of RT-DETR, causing unexpected logic (skipping iou-aware classification loss). Now, I try to reproduce the training RT-DETR in the ppdet on author's commit. I'll also try reproducing it in yolov8.
@hhaAndroid
I found this issue saying RT-DETR with r50 has a reproducing problem with PaddleDetection.
https://github.com/PaddlePaddle/PaddleDetection/issues/8381#issuecomment-1617769149
https://github.com/PaddlePaddle/PaddleDetection/issues/8402
I also always encounter the same issue when training R50 with PaddleDetection.
I try to migrate R34 or R18.
Current Status
- Migrate R18VD, R34VD
- Reproduce RT-DETR with r18vd in PPDet Definitely RT-DETR w/ R18VD in MMDet shows slower convergence compared to the one in PPDet.
# ppdet - first validation
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.111
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.180
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.114
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.075
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.135
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.141
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.196
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.366
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.461
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.233
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.493
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.652
[07/05 06:32:14] ppdet.engine INFO: Total sample number: 5000, average FPS: 101.27582890346476
[07/05 06:32:14] ppdet.engine INFO: Best test bbox ap is 0.111.
# mmdet - first validation
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.040
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.079
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.035
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.020
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.054
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.063
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.275
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.275
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.275
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.079
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.260
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.466
07/05 20:46:09 - mmengine - INFO - bbox_mAP_copypaste: 0.040 0.079 0.035 0.020 0.054 0.063
07/05 20:46:11 - mmengine - INFO - Epoch(val) [1][5000/5000] coco/bbox_mAP: 0.0400 coco/bbox_mAP_50: 0.0790 coco/bbox_mAP_75: 0.0350 coco/bbox_mAP_s: 0.0200 coco/bbox_mAP_m: 0.0540 coco/bbox_mAP_l: 0.0630 data_time: 0.0018 time: 0.0521
- I compare the model modules between MMDet and PPDet in training mode. -- Initialized the R18Vd model with pretrained model -- Feed Demo batch samples and compare features and loss directly I could find two differences which make losses different.
- VFL-like loss in RT-DETR uses alpha=0.75 while DETR's uses alpha=0.25. But it doesn't affect the performance in MMDet. (Matching metrics are not different between RT-DETR and DINO)
- average_factor for classification loss is different. I am figuring out this effect on convergence.
The left things to investigate are 'optimization logic' and 'augmentation strategies', or other training logics.
@hhaAndroid I found this issue saying RT-DETR with r50 has a reproducing problem with PaddleDetection. PaddlePaddle/PaddleDetection#8381 (comment) PaddlePaddle/PaddleDetection#8402
I also always encounter the same issue when training R50 with PaddleDetection.
I try to migrate R34 or R18.
@nijkah
I rerun rtdetr_r50vd_coco using this pure rtdetr code, https://github.com/lyuwenyu/RT-DETR, the first epoch result is normal,
@lyuwenyu Thank you. I didn't notice that the repository provides training code with ppdet. I also checked the first validation results.
The only thing left to figure out is why the migrated model shows slow convergence. 🤔
@lyuwenyu Thank you. I didn't notice that the repository provides training code with ppdet. I also checked the first validation results.
The only thing left to figure out is why the migrated model shows slow convergence. 🤔
@nijkah
I think you should notice default initialization of model's parameters, pretrained weights of backbone and lr_multi of backbone.
For initialization, you can initialize paddle rtdetr model and save states, then load it into pytorch rtdetr model, finally check both convergence curve.
For backbone, I provide pytorch resnet pretrained weights converted from paddle, and upload code and weights in this repo. https://github.com/lyuwenyu/RT-DETR/tree/main/rtdetr_pytorch
@lyuwenyu Thank you!
@nijkah I can help you debug. Can you push your weight conversion script? I'll merge into a new temporary branch, then we can collaborate to debug and finalize this PR together.
@hhaAndroid This is the model weight conversion script. https://gist.github.com/nijkah/5faf6ae356188690f353e3585d9bfc19 I put it into gist since it is so dirty code 🤣 I tested it with 'rtdetr_r18vd', 'rtdetr_r34vd', 'rtdetr_r50vd'. All converted models provide the same performance reported in PPDet.
You can use pre-converted weights for MMDet, here. https://github.com/nijkah/storage/releases
r18vd from ppdet
- 1 epoch training validation AP: 0.065 ~ 0.113
- 2 epoch training validation AP: 0.152 ~ 0.195
@nijkah I found several points worth noting: (1) The ppdet initialization and mmdet seem to be somewhat different, which may be more difficult to troubleshoot (2) ppdet uses syncbn, but mmdet does not (3) ppdet does not use norm decay, but mmdet does (4) There are still some gaps in data augmentation and ppdet, but it should not affect performance too much. Then I looked at the yolov8 replication, and I found that the data augmentation part directly uses yolov8's and does not match ppdet.
@nijkah I have verified that the loss part is not a problem.
@nijkah
mmdet/models/task_modules/assigners/hungarian_assigner.py │
│ :131 in assign │
│ │
│ 128 │ │ │ raise ImportError('Please run "pip install scipy" ' │
│ 129 │ │ │ │ │ │ │ 'to install scipy first.') │
│ 130 │ │ │
│ ❱ 131 │ │ matched_row_inds, matched_col_inds = linear_sum_assignment(cost) │
│ 132 │ │ matched_row_inds = torch.from_numpy(matched_row_inds).to(device) │
│ 133 │ │ matched_col_inds = torch.from_numpy(matched_col_inds).to(device)
ValueError: matrix contains invalid numeric entries
I trained this code, but during training this error occurred, have you encountered it before?
@nijkah
mmdet/models/task_modules/assigners/hungarian_assigner.py │ │ :131 in assign │ │ │ │ 128 │ │ │ raise ImportError('Please run "pip install scipy" ' │ │ 129 │ │ │ │ │ │ │ 'to install scipy first.') │ │ 130 │ │ │ │ ❱ 131 │ │ matched_row_inds, matched_col_inds = linear_sum_assignment(cost) │ │ 132 │ │ matched_row_inds = torch.from_numpy(matched_row_inds).to(device) │ │ 133 │ │ matched_col_inds = torch.from_numpy(matched_col_inds).to(device) ValueError: matrix contains invalid numeric entriesI trained this code, but during training this error occurred, have you encountered it before?
@hhaAndroid
I didn't encounter this with the current config.
This happens when cost has NaN or inf values. I encountered it before in the https://github.com/open-mmlab/mmdetection/pull/10335, when the overflow arises in FP16 settings.
Which config did you use for training?
@nijkah ./tools/dist_train.sh configs/rtdetr/rtdetr_r50vd_8xb2-72e_coco.py 4
@hhaAndroid I ran the config, and it seems working normally.
System environment:
sys.platform: linux
Python: 3.8.5 (default, Apr 6 2022, 10:47:05) [GCC 7.5.0]
CUDA available: True
numpy_random_seed: 1890175665
GPU 0,1,2,3: NVIDIA A40
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.9.1+cu111
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.0.5
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.10.1+cu111
OpenCV: 4.7.0
MMEngine: 0.8.2
Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 1890175665
Distributed launcher: pytorch
Distributed training: True
GPU number: 4
log
07/19 03:40:12 - mmengine - INFO - load model from: https://github.com/nijkah/storage/releases/download/v0.0.1/resnet50vd_ssld_v2_pretrained.pth
07/19 03:40:12 - mmengine - INFO - Loads checkpoint by http backend from path: https://github.com/nijkah/storage/releases/download/v0.0.1/resnet50vd_ssld_v2_pretrained.pth
Downloading: "https://github.com/nijkah/storage/releases/download/v0.0.1/resnet50vd_ssld_v2_pretrained.pth" to /root/.cache/torch/hub/checkpoints/resnet50vd_ssld_v2_pretrained.pth
07/19 03:40:15 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io
07/19 03:40:15 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.
07/19 03:40:15 - mmengine - INFO - Checkpoints will be saved to /mmdetection/work_dirs/rtdetr_r50vd_8xb2-72e_coco.
07/19 03:40:37 - mmengine - INFO - Epoch(train) [1][ 50/7393] base_lr: 2.5488e-06 lr: 2.5488e-06 eta: 2 days, 16:52:23 time: 0.4388 data_time: 0.0348 memory: 7915 grad_norm: 23.1806 loss: 46.5788 loss_cls: 0.3036 loss_bbox: 1.5355 loss_iou: 1.9450 d0.loss_cls: 0.2826 d0.loss_bbox: 1.5581 d0.loss_iou: 1.9657 d1.loss_cls: 0.2895 d1.loss_bbox: 1.5552 d1.loss_iou: 1.9561 d2.loss_cls: 0.2871 d2.loss_bbox: 1.5529 d2.loss_iou: 1.9506 d3.loss_cls: 0.2879 d3.loss_bbox: 1.5454 d3.loss_iou: 1.9511 d4.loss_cls: 0.2937 d4.loss_bbox: 1.5391 d4.loss_iou: 1.9482 enc_loss_cls: 0.2891 enc_loss_bbox: 1.5686 enc_loss_iou: 1.9728 dn_loss_cls: 0.9296 dn_loss_bbox: 1.0348 dn_loss_iou: 1.3886 d0.dn_loss_cls: 0.9047 d0.dn_loss_bbox: 1.0348 d0.dn_loss_iou: 1.3886 d1.dn_loss_cls: 0.9145 d1.dn_loss_bbox: 1.0348 d1.dn_loss_iou: 1.3886 d2.dn_loss_cls: 0.9081 d2.dn_loss_bbox: 1.0348 d2.dn_loss_iou: 1.3886 d3.dn_loss_cls: 0.8958 d3.dn_loss_bbox: 1.0348 d3.dn_loss_iou: 1.3886 d4.dn_loss_cls: 0.9078 d4.dn_loss_bbox: 1.0348 d4.dn_loss_iou: 1.3886
07/19 03:40:57 - mmengine - INFO - Epoch(train) [1][ 100/7393] base_lr: 5.0475e-06 lr: 5.0475e-06 eta: 2 days, 14:36:48 time: 0.4083 data_time: 0.0110 memory: 7915 grad_norm: 25.3009 loss: 41.0585 loss_cls: 0.2421 loss_bbox: 1.3954 loss_iou: 1.6754 d0.loss_cls: 0.2355 d0.loss_bbox: 1.4214 d0.loss_iou: 1.7086 d1.loss_cls: 0.2378 d1.loss_bbox: 1.4147 d1.loss_iou: 1.6978 d2.loss_cls: 0.2326 d2.loss_bbox: 1.4053 d2.loss_iou: 1.6909 d3.loss_cls: 0.2304 d3.loss_bbox: 1.4080 d3.loss_iou: 1.6868 d4.loss_cls: 0.2340 d4.loss_bbox: 1.4020 d4.loss_iou: 1.6824 enc_loss_cls: 0.2415 enc_loss_bbox: 1.4307 enc_loss_iou: 1.7208 dn_loss_cls: 0.7357 dn_loss_bbox: 0.9159 dn_loss_iou: 1.2679 d0.dn_loss_cls: 0.8023 d0.dn_loss_bbox: 0.9159 d0.dn_loss_iou: 1.2682 d1.dn_loss_cls: 0.7928 d1.dn_loss_bbox: 0.9159 d1.dn_loss_iou: 1.2682 d2.dn_loss_cls: 0.7640 d2.dn_loss_bbox: 0.9159 d2.dn_loss_iou: 1.2681 d3.dn_loss_cls: 0.7348 d3.dn_loss_bbox: 0.9158 d3.dn_loss_iou: 1.2681 d4.dn_loss_cls: 0.7311 d4.dn_loss_bbox: 0.9158 d4.dn_loss_iou: 1.2680
07/19 03:41:17 - mmengine - INFO - Epoch(train) [1][ 150/7393] base_lr: 7.5463e-06 lr: 7.5463e-06 eta: 2 days, 13:46:04 time: 0.4065 data_time: 0.0101 memory: 7915 grad_norm: 27.5922 loss: 42.6928 loss_cls: 0.2918 loss_bbox: 1.3631 loss_iou: 1.7091 d0.loss_cls: 0.2804 d0.loss_bbox: 1.4013 d0.loss_iou: 1.7486 d1.loss_cls: 0.2807 d1.loss_bbox: 1.3993 d1.loss_iou: 1.7398 d2.loss_cls: 0.2832 d2.loss_bbox: 1.3796 d2.loss_iou: 1.7334 d3.loss_cls: 0.2887 d3.loss_bbox: 1.3747 d3.loss_iou: 1.7248 d4.loss_cls: 0.2902 d4.loss_bbox: 1.3647 d4.loss_iou: 1.7192 enc_loss_cls: 0.2917 enc_loss_bbox: 1.4160 enc_loss_iou: 1.7576 dn_loss_cls: 0.7288 dn_loss_bbox: 0.9966 dn_loss_iou: 1.3852 d0.dn_loss_cls: 0.8279 d0.dn_loss_bbox: 0.9946 d0.dn_loss_iou: 1.3868 d1.dn_loss_cls: 0.7903 d1.dn_loss_bbox: 0.9947 d1.dn_loss_iou: 1.3863 d2.dn_loss_cls: 0.7570 d2.dn_loss_bbox: 0.9949 d2.dn_loss_iou: 1.3859 d3.dn_loss_cls: 0.7320 d3.dn_loss_bbox: 0.9953 d3.dn_loss_iou: 1.3855 d4.dn_loss_cls: 0.7321 d4.dn_loss_bbox: 0.9959 d4.dn_loss_iou: 1.3853
07/19 03:41:38 - mmengine - INFO - Epoch(train) [1][ 200/7393] base_lr: 1.0045e-05 lr: 1.0045e-05 eta: 2 days, 13:28:37 time: 0.4101 data_time: 0.0099 memory: 7915 grad_norm: 32.9975 loss: 46.3005 loss_cls: 0.3720 loss_bbox: 1.7145 loss_iou: 2.1193 d0.loss_cls: 0.3407 d0.loss_bbox: 1.7942 d0.loss_iou: 2.1889 d1.loss_cls: 0.3420 d1.loss_bbox: 1.7773 d1.loss_iou: 2.1840 d2.loss_cls: 0.3524 d2.loss_bbox: 1.7522 d2.loss_iou: 2.1678 d3.loss_cls: 0.3517 d3.loss_bbox: 1.7473 d3.loss_iou: 2.1477 d4.loss_cls: 0.3606 d4.loss_bbox: 1.7302 d4.loss_iou: 2.1305 enc_loss_cls: 0.3596 enc_loss_bbox: 1.8170 enc_loss_iou: 2.2043 dn_loss_cls: 0.5826 dn_loss_bbox: 0.8497 dn_loss_iou: 1.2727 d0.dn_loss_cls: 0.6792 d0.dn_loss_bbox: 0.8332 d0.dn_loss_iou: 1.2714 d1.dn_loss_cls: 0.6333 d1.dn_loss_bbox: 0.8345 d1.dn_loss_iou: 1.2709 d2.dn_loss_cls: 0.6077 d2.dn_loss_bbox: 0.8369 d2.dn_loss_iou: 1.2704 d3.dn_loss_cls: 0.5922 d3.dn_loss_bbox: 0.8403 d3.dn_loss_iou: 1.2705 d4.dn_loss_cls: 0.5847 d4.dn_loss_bbox: 0.8450 d4.dn_loss_iou: 1.2715
07/19 03:41:58 - mmengine - INFO - Epoch(train) [1][ 250/7393] base_lr: 1.2544e-05 lr: 1.2544e-05 eta: 2 days, 13:20:56 time: 0.4118 data_time: 0.0092 memory: 7915 grad_norm: 37.4890 loss: 41.7088 loss_cls: 0.3467 loss_bbox: 1.3374 loss_iou: 1.5719 d0.loss_cls: 0.2950 d0.loss_bbox: 1.4405 d0.loss_iou: 1.6661 d1.loss_cls: 0.2994 d1.loss_bbox: 1.4163 d1.loss_iou: 1.6506 d2.loss_cls: 0.3101 d2.loss_bbox: 1.3913 d2.loss_iou: 1.6229 d3.loss_cls: 0.3238 d3.loss_bbox: 1.3727 d3.loss_iou: 1.6017 d4.loss_cls: 0.3382 d4.loss_bbox: 1.3523 d4.loss_iou: 1.5808 enc_loss_cls: 0.2944 enc_loss_bbox: 1.4647 enc_loss_iou: 1.6898 dn_loss_cls: 0.6230 dn_loss_bbox: 1.0615 dn_loss_iou: 1.3754 d0.dn_loss_cls: 0.7424 d0.dn_loss_bbox: 1.0067 d0.dn_loss_iou: 1.3609 d1.dn_loss_cls: 0.6853 d1.dn_loss_bbox: 1.0110 d1.dn_loss_iou: 1.3609 d2.dn_loss_cls: 0.6522 d2.dn_loss_bbox: 1.0196 d2.dn_loss_iou: 1.3627 d3.dn_loss_cls: 0.6385 d3.dn_loss_bbox: 1.0323 d3.dn_loss_iou: 1.3662 d4.dn_loss_cls: 0.6253 d4.dn_loss_bbox: 1.0475 d4.dn_loss_iou: 1.3707
07/19 03:42:19 - mmengine - INFO - Epoch(train) [1][ 300/7393] base_lr: 1.5043e-05 lr: 1.5043e-05 eta: 2 days, 13:09:39 time: 0.4077 data_time: 0.0096 memory: 7915 grad_norm: 57.1405 loss: 42.8358 loss_cls: 0.3730 loss_bbox: 1.4224 loss_iou: 1.7573 d0.loss_cls: 0.3162 d0.loss_bbox: 1.5481 d0.loss_iou: 1.8606 d1.loss_cls: 0.3289 d1.loss_bbox: 1.5076 d1.loss_iou: 1.8292 d2.loss_cls: 0.3403 d2.loss_bbox: 1.4711 d2.loss_iou: 1.8042 d3.loss_cls: 0.3521 d3.loss_bbox: 1.4512 d3.loss_iou: 1.7779 d4.loss_cls: 0.3665 d4.loss_bbox: 1.4345 d4.loss_iou: 1.7621 enc_loss_cls: 0.3167 enc_loss_bbox: 1.5864 enc_loss_iou: 1.8978 dn_loss_cls: 0.5547 dn_loss_bbox: 1.0063 dn_loss_iou: 1.3510 d0.dn_loss_cls: 0.6752 d0.dn_loss_bbox: 0.9197 d0.dn_loss_iou: 1.3151 d1.dn_loss_cls: 0.6211 d1.dn_loss_bbox: 0.9339 d1.dn_loss_iou: 1.3194 d2.dn_loss_cls: 0.5888 d2.dn_loss_bbox: 0.9535 d2.dn_loss_iou: 1.3270 d3.dn_loss_cls: 0.5662 d3.dn_loss_bbox: 0.9745 d3.dn_loss_iou: 1.3356 d4.dn_loss_cls: 0.5542 d4.dn_loss_bbox: 0.9922 d4.dn_loss_iou: 1.3436
07/19 03:42:39 - mmengine - INFO - Epoch(train) [1][ 350/7393] base_lr: 1.7541e-05 lr: 1.7541e-05 eta: 2 days, 13:08:10 time: 0.4130 data_time: 0.0118 memory: 7912 grad_norm: 53.6947 loss: 40.5328 loss_cls: 0.3633 loss_bbox: 1.2334 loss_iou: 1.6118 d0.loss_cls: 0.3450 d0.loss_bbox: 1.3106 d0.loss_iou: 1.6730 d1.loss_cls: 0.3480 d1.loss_bbox: 1.2791 d1.loss_iou: 1.6474 d2.loss_cls: 0.3529 d2.loss_bbox: 1.2626 d2.loss_iou: 1.6260 d3.loss_cls: 0.3558 d3.loss_bbox: 1.2460 d3.loss_iou: 1.6224 d4.loss_cls: 0.3619 d4.loss_bbox: 1.2372 d4.loss_iou: 1.6159 enc_loss_cls: 0.3291 enc_loss_bbox: 1.3630 enc_loss_iou: 1.7211 dn_loss_cls: 0.5397 dn_loss_bbox: 1.0101 dn_loss_iou: 1.3940 d0.dn_loss_cls: 0.6592 d0.dn_loss_bbox: 0.9379 d0.dn_loss_iou: 1.3688 d1.dn_loss_cls: 0.5947 d1.dn_loss_bbox: 0.9603 d1.dn_loss_iou: 1.3753 d2.dn_loss_cls: 0.5650 d2.dn_loss_bbox: 0.9797 d2.dn_loss_iou: 1.3821 d3.dn_loss_cls: 0.5461 d3.dn_loss_bbox: 0.9948 d3.dn_loss_iou: 1.3879 d4.dn_loss_cls: 0.5365 d4.dn_loss_bbox: 1.0040 d4.dn_loss_iou: 1.3914
mmcv==2.0.0, using the latest commit in this branch.
Can I ask you to provide more information about your environment?
@nijkah I think this is an intermittent issue, I am now trying to reproduce it. But I found that training with 4 x V100s is very slow, it takes about 3.5 days. It feels too slow.
@hhaAndroid It is because the default training of RT-DETR is 6x epochs setting. It is quite long compared to DINO's 1x epoch training. After reproducing the full training once, you can debug the training convergence by comparing the first validation performance with https://github.com/open-mmlab/mmdetection/pull/10498#issuecomment-1630177462.
@hhaAndroid It's true. I followed the original implementation. You can check the log and config file from the author's repository. https://github.com/lyuwenyu/RT-DETR/tree/main/rtdetr_paddle
@nijkah I found that even after changing data augmentation, the effect has not improved. Now I cannot determine which part is the problem, my suggestion is that you can copy the implementation in yolov8, replace the model + loss part completely, keep the data augmentation the same as in mdet, this can exclude whether it is really the data augmentation problem. Do you have time to try?
@nijkah This PR is quite troublesome. You can push it in your free time without interfering with your own work.
Hello guys, thanks for the yours hard work. I will try to reproduce RT-DETR on mmdet3 too. I will write here, if I got some important results
@rydenisbak Thank you. I'm now facing some issues reproducing the results, and it would be greatly appreciated if you could help identify the problems.
I find a issue in function generate_proposals,
there should be
valid_wh = torch.tensor([W, H], dtype=torch.float32, device=device)
it may unvalid some area uncorrectly, but it should not affect the situation when W==H