trafficstars

Motivation

Implement the results in the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection" with mmdetection. The original code has released.

Remark

This is my first time to submit a PR to reproduces an algorithm, and I'm still lack of experience. I hope that developers who pay attention to this PR can comment more problems in my implementation and share some results. I will also follow up and share some of my own experimental results and progress arrangements.

Jul 13 '22 14:07 Li-Qingyun

All committers have signed the CLA.

Jul 13 '22 14:07 CLAassistant

Hi, thanks very much for your code. May I ask if you have met this problem?

Jul 25 '22 06:07 czczup

Hi, thanks very much for your code. May I ask if you have met this problem?

When did this happen ? at the beginning or after a few epochs of training ?

Jul 25 '22 06:07 Li-Qingyun

@czczup I haven't fully trained yet, but I have tested 1000 iters training (2 GPU dist_train). I didn't met it. Could you please provide more information? Thanks for your attention.

Jul 25 '22 06:07 Li-Qingyun

Hi, thanks very much for your code. May I ask if you have met this problem?

When did this happen ? at the beginning or after a few epochs of training ?

Thanks for your reply. I use a single GPU to train (debug) with the config you provided. It happened at the beginning of training (50 iters). As you can see in this log:

It seems that bbox_head.label_embedding.weight did not receive a gradient during some iterations. I could use find_unused_parameters=True to avoid this problem, but it would affect training efficiency.

Jul 25 '22 06:07 czczup

@czczup Hi, the newest config file contains other incompatibility problems.

norm_cfg=dict(type='FrozenBatchNorm2d'), # should use BN instead
conv_cfg=dict(type='Conv2d', bias=True), # should be deleted

I made modification in my local workstation, which I haven't committed. Did you meet these problems.

Jul 25 '22 06:07 Li-Qingyun

@czczup I'm working on experiments for aligning training results, there is still a few modification which i have not committed.

You can comment the FrozenBatchNorm2d, use BN instead. I modified the bn file of mmcv to registry the FrozenBatchNorm2d, but i prepare to delete it at last. In the original DINO, the convs of neck has bias, which is not recommended in mmdetection. To align inference accuracy, i had to modify the conv building func in mmcv. You can comment this line.

Jul 25 '22 06:07 Li-Qingyun

@czczup The label_embedding is used in DINOHead.forward_train() and DnQueryGenerator.forward(). I haven't found the bug yet, and maybe a few days later it'll be found. I'm pushing full speed ahead with the training. And more details will be committed. Thanks for your attention and discussion.

Jul 25 '22 07:07 Li-Qingyun

@Li-Qingyun When the latest config is used, the label_embedding problem disappeared. Thanks for helping. I checked with the paper and the source code published by the authors, and it seems that most of the settings are aligned, except for the learning_rate (1e-4 in paper but 2e-4 in this config) and noise_scale (0.5&0.4 in paper but 0.5&1.0 in this config). I'm not sure if there's any detail I missed. I plan to train DINO-R50 for 12 epochs with 8 GPUs today.

Jul 25 '22 07:07 czczup

@czczup Hi, the original repo has a sh file to launch training, in which the box_noise_scale is set 1.

The description is at the last paragraph of the D.3 (Supplementary Materials) of the paper.

Multi-scale setting. For our 4-scale models, we extract features from stages 2, 3, and 4 of the backbone and add an extra feature by down-sampling the output of the stage 4. An additional feature map of the backbone stage 1 is used for our 5-scale models. For hyper-parameters, we set λ1 = 1.0 and λ2 = 2.0 and use 100 CDN pairs which contain 100 positive queries and 100 negative queries.

The learning rate was indeed misaligned, thank you very much!

Jul 25 '22 08:07 Li-Qingyun

I met a new problem, it happens in the middle of training, such as 2000 iterations. I'm trying to fix it. I notice that it happens when dn_cls_scores is None, so that keys in loss_dict are different across GPUs.

AssertionError: loss log variables are different across GPUs! 
rank 2 len(log_vars): 21 keys: interm_loss_cls,interm_loss_bbox,interm_loss_iou,loss_cls,loss_bbox,loss_iou,d0.loss_cls,d0.loss_bbox,d0.loss_iou,d1.loss_cls,d1.loss_bbox,d1.loss_iou,d2.loss_cls,d2.loss_bbox,d2.loss_iou,d3.loss_cls,d3.loss_bbox,d3.loss_iou,d4.loss_cls,d4.loss_bbox,d4.loss_iou

Jul 25 '22 12:07 czczup

@czczup Thanks! Maybe it's the special case when the current sample has no target, I'll have a check.

Jul 25 '22 13:07 Li-Qingyun

@czczup Hi, I'm waiting for the queue of lab's slurm cluster, so can't experiment in time. Could you have a test of setting filter_empty_gt=True?

data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(filter_empty_gt=True, pipeline=train_pipeline),  # Modify here
    val=dict(pipeline=test_pipeline),
    test=dict(pipeline=test_pipeline))

Jul 26 '22 01:07 Li-Qingyun

@Li-Qingyun Only set filter_empty_gt=True is not enough, I also set allow_negative_crop=False. Image without any bbox ground-truth will cause this problem.

Jul 26 '22 01:07 czczup

@Li-Qingyun Only set filter_empty_gt=True is not enough, I also set allow_negative_crop=False. Image without any bbox ground-truth will cause this problem.

My model is still in training and I can share my results with you when it's finished

Jul 26 '22 01:07 czczup

@czczup thank you very much for the information

The model modules should have the ability to handle this kind of case. I will check and fix this. Thanks for your sharing.

Jul 26 '22 01:07 Li-Qingyun

@Li-Qingyun

My last box AP is 48.4 and 0.6 points lower than the official repo (49.0 box AP), I suspect skipping negative samples may cause some performance degradation.

Jul 26 '22 04:07 czczup

@czczup thanks for your sharing!

In the latest commit, I have modified DINOHead.loss_dn_single() at mmdet/models/dense_heads/dino_head.py to return Tensor(0) as each loss for dn. (extract_dn_outputs() has been modified too)

Jul 26 '22 07:07 Li-Qingyun

My Finished work

In the past week, I have finished aligning inference accuracy (with converted ckpt released by the authors, FrozenBatchNorm and Convs with bias in neck were used), and aligned the losses generated by the two codebases from the same samples (gap about 2e-4 of the total losses). The state_dict convert scripts and other aligning utils has been upload to this repo. The aforementioned work were all conducted on DINO_4scales.

Experimental modifications which are not committed

/mmcv/cnn/bricks/norm.py

For setting norm_cfg=dict(type='FrozenBatchNorm2d')

/mmcv/cnn/bricks/conv.py

For setting conv_cfg=dict(type='Conv2d', bias=True)

/mmdet/dataset/coco.py

The original DINO repo set num_class=91, to align inference map without training, the CocoDataset is modified to get aligned target labels. I have uploaded the coco.py file to this repo. Overwrite the coco.py, and you can set add --options data.train.continuous_categories=False model.bbox_head.num_classes=91 at the end of your training commands. For example:

python tools/train.py \
    configs/dino/dino_4scale_r50_16x2_12e_coco.py \
    --seed 42 \
    --deterministic \
    --options \
    data.train.continuous_categories=False \
    model.bbox_head.num_classes=91

Following arrangement

Additionally, I'm going to have a check to make sure the model initialization is aligned with the source code.

I'm still waiting for the queue. The computing resources of the lab seem tight recently, I decide to try using 2 GPU with 4x bigger batch_size for exp.

Jul 26 '22 07:07 Li-Qingyun

Results of preliminary training experiments of the current version

Results of DINO_4scale_R50_12e

Train with origin repo code

script:

PARTITION_NAME=$1
NUM_GPUS=8

srun -p $PARTITION_NAME -n 1 --ntasks-per-node=1 --gres=gpu:$NUM_GPUS --cpus-per-task=40 \
  python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS main.py \
        --output_dir logs/DINO/R50-MS4 -c config/DINO/DINO_4scale.py --coco_path data/coco/ \
        --options dn_scalar=100 embed_init_tgt=TRUE \
        dn_label_coef=1.0 dn_bbox_coef=1.0 use_ema=False \
        dn_box_noise_scale=1.0

Results (Official: 49.0):

seed=42:

"test_coco_eval_bbox": [0.4873132674960999, 0.6613892204926395, 0.5299718405240965, 0.3097563780570846, 0.5200259496800922, 0.6276603680117593, 0.3760974947699587, 0.648398246301584, 0.7259767833244375, 0.5585821102614945, 0.7691926278555588, 0.8832943363588871

seed=1999:

"test_coco_eval_bbox": [0.48881166833204975, 0.6625810511292044, 0.5337058970483411, 0.310836480374128, 0.5213143580767997, 0.6386154609695947, 0.375656002127738, 0.648566769851021, 0.7225129787686829, 0.5469380246490395, 0.7644535576931001, 0.8857967425155622]

seed=2022:

"test_coco_eval_bbox": [0.48596151760994727, 0.6591535604313493, 0.5308893680282089, 0.3048096267805175, 0.5166487219765998, 0.6290219521155032, 0.37418452677145564, 0.6453000252716654, 0.7215818952300844, 0.5433005018910609, 0.7590221671325503, 0.8817776135205894]

I'll experiment a few more times to see the approximate range of fluctuations in the results

Train with committed code

script:

PARTITION_NAME=$1
SCALE=$2
EPOCH=$3
SEED=$4
ARGS=${@:5}


GPUS=8 GPUS_PER_NODE=8 \
    ./tools/slurm_train.sh \
    mm_det \
    dino_${SCALE}scale_r50_8x2_${EPOCH}e_coco \
    ./configs/dino/dino_${SCALE}scale_r50_8x2_${EPOCH}e_coco.py \
    /mnt/lustre/liqingyun.vendor/work_dirs/dino_${SCALE}scale_r50_8x2_${EPOCH}e_coco_seed_${SEED} \
    --seed ${SEED} ${ARGS}

Results:

The results below adopted FrozenBatchNorm2d and Conv w/ bias for neck for model arch aligning, please refer to previous comments for modifications required to run. I'll conduct experiments of the mmcv native module.

seed=42 --deterministic

bbox_mAP: 0.4860, bbox_mAP_50: 0.6600, bbox_mAP_75: 0.5310, bbox_mAP_s: 0.3120, bbox_mAP_m: 0.5190, bbox_mAP_l: 0.6310

seed=68

bbox_mAP: 0.4890, bbox_mAP_50: 0.6610, bbox_mAP_75: 0.5340, bbox_mAP_s: 0.3050, bbox_mAP_m: 0.5190, bbox_mAP_l: 0.6410

seed=1920

bbox_mAP: 0.4880, bbox_mAP_50: 0.6620, bbox_mAP_75: 0.5340, bbox_mAP_s: 0.3040, bbox_mAP_m: 0.5230, bbox_mAP_l: 0.6320

seed=1999

bbox_mAP: 0.4880, bbox_mAP_50: 0.6620, bbox_mAP_75: 0.5340, bbox_mAP_s: 0.3040, bbox_mAP_m: 0.5230, bbox_mAP_l: 0.6320

seed=2022

bbox_mAP: 0.4860, bbox_mAP_50: 0.6580, bbox_mAP_75: 0.5300, bbox_mAP_s: 0.3080, bbox_mAP_m: 0.5210, bbox_mAP_l: 0.6310

seed=6666

bbox_mAP: 0.4860, bbox_mAP_50: 0.6590, bbox_mAP_75: 0.5300, bbox_mAP_s: 0.3050, bbox_mAP_m: 0.5150, bbox_mAP_l: 0.6360

seed=6666, --diff-seed

bbox_mAP: 0.4850, bbox_mAP_50: 0.6580, bbox_mAP_75: 0.5290, bbox_mAP_s: 0.3110, bbox_mAP_m: 0.5190, bbox_mAP_l: 0.6280

seed=2022, --diff-seed

bbox_mAP: 0.4840, bbox_mAP_50: 0.6540, bbox_mAP_75: 0.5270, bbox_mAP_s: 0.3090, bbox_mAP_m: 0.5160, bbox_mAP_l: 0.6320

The results below adopted mmcv native BN2d and Neck, see the configs of latest version.

seed=42

bbox_mAP: 0.4860, bbox_mAP_50: 0.6600, bbox_mAP_75: 0.5320, bbox_mAP_s: 0.3120, bbox_mAP_m: 0.5210, bbox_mAP_l: 0.6330

seed=1920

bbox_mAP: 0.4870, bbox_mAP_50: 0.6590, bbox_mAP_75: 0.5290, bbox_mAP_s: 0.3150, bbox_mAP_m: 0.5180, bbox_mAP_l: 0.6330

seed=1999

bbox_mAP: 0.4870, bbox_mAP_50: 0.6620, bbox_mAP_75: 0.5340, bbox_mAP_s: 0.3040, bbox_mAP_m: 0.5220, bbox_mAP_l: 0.6360

seed=2022

bbox_mAP: 0.4880, bbox_mAP_50: 0.6620, bbox_mAP_75: 0.5310, bbox_mAP_s: 0.3110, bbox_mAP_m: 0.5220, bbox_mAP_l: 0.6340

Results of DINO_4scale_R50_36e

Train with origin repo code

command:

PARTITION_NAME=$1
NUM_GPUS=8

srun -p $PARTITION_NAME -n 1 --ntasks-per-node=1 --gres=gpu:$NUM_GPUS --cpus-per-task=40 \
  python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS main.py \
        --output_dir logs/DINO/R50-MS4-seed_1999 -c config/DINO/DINO_4scale.py --coco_path data/coco/ \
        --seed 1999 \
        --options dn_scalar=100 embed_init_tgt=TRUE \
        dn_label_coef=1.0 dn_bbox_coef=1.0 use_ema=False \
        dn_box_noise_scale=1.0 epochs=36 lr_drop=30

Results (Official: 50.9):

seed=1999:

"test_coco_eval_bbox": [0.5053947020639069, 0.685007231553511, 0.5513394037459328, 0.331298595377718, 0.5390263215394429, 0.6505427418959137, 0.3818753372181372, 0.6584023194710107, 0.7304000110183289, 0.56530672690942, 0.773200397094524, 0.8805619748917786], "best_res": 0.5053947020639069]

Train with committed code

Results:

The results below adopted mmcv native BN2d and Neck, see the configs of latest version.

seed=1920

bbox_mAP: 0.5070, bbox_mAP_50: 0.6870, bbox_mAP_75: 0.5540, bbox_mAP_s: 0.3330, bbox_mAP_m: 0.5370, bbox_mAP_l: 0.6560

seed=2022

bbox_mAP: 0.5050, bbox_mAP_50: 0.6830, bbox_mAP_75: 0.5530, bbox_mAP_s: 0.3390, bbox_mAP_m: 0.5370, bbox_mAP_l: 0.6580

Aug 08 '22 03:08 Li-Qingyun

我遇到了一个新问题，它发生在训练的中间，比如 2000 次迭代。我正在尝试修复它。我注意到它在dn_cls_scores为 None 时发生，因此 loss_dict 中的键在 GPU 之间是不同的。
AssertionError: loss log variables are different across GPUs! 
rank 2 len(log_vars): 21 keys: interm_loss_cls,interm_loss_bbox,interm_loss_iou,loss_cls,loss_bbox,loss_iou,d0.loss_cls,d0.loss_bbox,d0.loss_iou,d1.loss_cls,d1.loss_bbox,d1.loss_iou,d2.loss_cls,d2.loss_bbox,d2.loss_iou,d3.loss_cls,d3.loss_bbox,d3.loss_iou,d4.loss_cls,d4.loss_bbox,d4.loss_iou

您好您是如何解决这个问题的？

Aug 09 '22 02:08 zhaoguoqing12

我遇到了一个新问题，它发生在训练的中间，比如 2000 次迭代。我正在尝试修复它。我注意到它在dn_cls_scores为 None 时发生，因此 loss_dict 中的键在 GPU 之间是不同的。
AssertionError: loss log variables are different across GPUs! 
rank 2 len(log_vars): 21 keys: interm_loss_cls,interm_loss_bbox,interm_loss_iou,loss_cls,loss_bbox,loss_iou,d0.loss_cls,d0.loss_bbox,d0.loss_iou,d1.loss_cls,d1.loss_bbox,d1.loss_iou,d2.loss_cls,d2.loss_bbox,d2.loss_iou,d3.loss_cls,d3.loss_bbox,d3.loss_iou,d4.loss_cls,d4.loss_bbox,d4.loss_iou
您好您是如何解决这个问题的？

现在的代码已经没有这个问题了

Aug 09 '22 02:08 czczup

@zhaoguoqing12 你好，这个问题我们已经在 390b25a 基本解决

请问你是运行最新版本再次出现该问题吗。你可以检查你本地仓库的版本，目前最新版本为 d380deb，如果你本地不是最新版本，可以通过 git pull https://github.com/Li-Qingyun/mmdetection.git add-dino 更新本地仓库。如果是最新版本出现的问题，请麻烦提供一下环境版本，本地修改，运行命令等信息

Aug 09 '22 02:08 Li-Qingyun

@zhaoguoqing12你好，这个问题我们已经在390b25a基本解决方案

你可以查看你本地仓库的最新版本，当前最新d380deb，你本地不是最新版本，git pull https://github.com/Li-Qingyun/mmdetection.git add-dino 可以更新本地仓库。是如果出现的更新版本修改问题，请解决一下最新的环境修改，本地提供的信息，运行命令等

我确实不是最新版本，但是我并不是在训练dino出现的这个问题。

Aug 09 '22 03:08 zhaoguoqing12

@zhaoguoqing12 如果是你自己的仓库出现相同的问题，你需要保证 loss_dict.keys() 是固定不变的。例如：在目前版本的DINO中loss_dict应当有39项目，我的检查方法为，在 loss_dict 被 reduce 之前:

if len(loss_dict) != 39:
    import ipdb; ipdb.set_trace()

通过移动帧发现 loss_dict.keys() 中缺少 dn 相关的 loss，所以在获取dn loss的函数中增加无目标情况的分支，来保证所有的情况下，loss_dict 都稳定地由 39 项构成。目前这里的操作能暂时解决该问题，后续review阶段可能会用更好的方式替代。（例如，在没有目标的情况下，loss_giou和loss_bbox就自然为0，这种情况下补充的 loss_dn为0，但没有计算图）

我发现 DINO 源码中其实也有防止类似报错的操作，但 loss 部分基本是完全重构的，所以当时并没有注意到~

所以提供给你的思路是，去检查一下你报错信息中的这个 loss_dict 的获取。可以用上面的方式定位到报错情况，然后对特殊情况进行补位处理。

Aug 09 '22 03:08 Li-Qingyun

@Li-Qingyun 好的感谢我会尝试的

Aug 09 '22 03:08 zhaoguoqing12

We start to refactor modules of DETR-like models to enhance the usability and readability of our codebase. For not affecting the progress of experiments in the current version and the works of those who follow our PR. I create a new PR of refactor:

#9149

I'll merge the refactor PR to this PR when it's finished. The followers can continue to conduct experiments based on our current PR, and the relevant bugs will continue to be fixed too. Besides, the experimental results will continue to be released.

Thank you for your attention!

Aug 25 '22 08:08 Li-Qingyun

We decide to complete the development of DINO on MMDetection 2.x in this pr. The code remains using the old style (the code has been refactored in MMDetection 3.x for a new style). The users are recommended to use #9149 in MMDetection 3.x. This support is mainly for the users who may have to use old versions.

TO-DO List

[x] refactor CdnQueryGenerator
[ ] Add DocString
- [ ] detector class
- [ ] Dino Transformer
- [ ] Dino Transformer Decoder
- [ ] DinoHead
- [ ] CdnQueryGenerator
[ ] Add unit test
[ ] Add readme
[ ] Add model to zoo
[ ] Add metafiles

Feb 19 '23 02:02 Li-Qingyun

mmdetection
mmdetection copied to clipboard

[WIP]: Add DINO

Motivation

Remark

My Finished work

Experimental modifications which are not committed

/mmcv/cnn/bricks/norm.py

/mmcv/cnn/bricks/conv.py

/mmdet/dataset/coco.py

Following arrangement

Results of preliminary training experiments of the current version

Results of DINO_4scale_R50_12e

Train with origin repo code

script:

Results (Official: 49.0):

Train with committed code

script:

Results:

Other problems

Results of DINO_4scale_R50_36e

Train with origin repo code

command:

Results (Official: 50.9):

Train with committed code

Results:

TO-DO List

mmdetection mmdetection copied to clipboard

[WIP]: Add DINO

Motivation

Remark

My Finished work

Experimental modifications which are not committed

/mmcv/cnn/bricks/norm.py

/mmcv/cnn/bricks/conv.py

/mmdet/dataset/coco.py

Following arrangement

Results of preliminary training experiments of the current version

Results of DINO_4scale_R50_12e

Train with origin repo code

script:

Results (Official: 49.0):

Train with committed code

script:

Results:

Other problems

Results of DINO_4scale_R50_36e

Train with origin repo code

command:

Results (Official: 50.9):

Train with committed code

Results:

TO-DO List

mmdetection
mmdetection copied to clipboard