mmdetection [Bug] A bug happen in the train of semi-supervised object detection

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (master) or latest version (3.x).

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

3.x branch https://github.com/open-mmlab/mmdetection/tree/3.x

Environment

sys.platform: linux Python: 3.8.10 (default, Jun 4 2021, 15:09:15) [GCC 7.5.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 3060 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.109 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.11.0+cu113 PyTorch compiling details: PyTorch built with:

GCC 7.3 C++ Version: 201402 Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e) OpenMP 201511 (a.k.a. OpenMP 4.5) LAPACK is enabled (usually provided by MKL) NNPACK is enabled CPU capability usage: AVX2 CUDA Runtime 11.3 NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86 CuDNN 8.2 Magma 2.5.2 Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, TorchVision: 0.12.0+cu113 OpenCV: 4.6.0 MMEngine: 0.3.2 MMDetection: 3.0.0rc5+92d03df

Reproduces the problem - code sample

I use the config which inherit the config in configs/soft_teacher/soft-teacher_faster-rcnn_r50-caffe_fpn_180k_semi-0.1-coco.py and change the key of 'num_classes', 'labeled_dataset.ann_file', 'unlabeled_dataset.ann_file', ‘train_cfg ’ and add key of 'metainfo' in each dataset dict part.

the metainfo as below metainfo=dict( classes=('a', 'b', 'c'), palette=[(255, 0, 0), (0, 255, 0), (0, 0, 255)]),

the train_cfg as below train_cfg = dict(type='IterBasedTrainLoop', max_iters=3000, val_interval=600)

Reproduces the problem - command or script

python tools/train.py configs/faster_rcnn/faster_rcnn_semi_detection.py

Reproduces the problem - error message

02/15 13:42:06 - mmengine - INFO - Checkpoints will be saved to /root/mmdetection/work_dir/faster_unlabeled_test. 02/15 13:42:39 - mmengine - INFO - Iter(train) [ 50/3000] lr: 9.9098e-04 eta: 0:31:54 time: 0.6489 data_time: 0.0164 memory: 9140 loss: 3.2496 sup_loss_rpn_cls: 0.7732 sup_loss_rpn_bbox: 0.3367 sup_loss_cls: 0.8544 sup_acc: 93.5547 sup_loss_bbox: 0.0448 unsup_loss_rpn_cls: 0.8679 unsup_loss_rpn_bbox: 0.0009 unsup_loss_cls: 0.3717 unsup_acc: 100.0000 unsup_loss_bbox: 0.0000 02/15 13:43:10 - mmengine - INFO - Iter(train) [ 100/3000] lr: 1.9920e-03 eta: 0:30:40 time: 0.6205 data_time: 0.0169 memory: 9140 loss: 2.0364 sup_loss_rpn_cls: 0.8109 sup_loss_rpn_bbox: 0.3330 sup_loss_cls: 0.4660 sup_acc: 94.9219 sup_loss_bbox: 0.0895 unsup_loss_rpn_cls: 0.2801 unsup_loss_rpn_bbox: 0.0000 unsup_loss_cls: 0.0568 unsup_acc: 100.0000 unsup_loss_bbox: 0.0000 Traceback (most recent call last): File "tools/train.py", line 130, in main() File "tools/train.py", line 126, in main runner.train() File "/root/miniconda3/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1684, in train model = self.train_loop.run() # type: ignore File "/root/miniconda3/lib/python3.8/site-packages/mmengine/runner/loops.py", line 264, in run self.run_iter(data_batch) File "/root/miniconda3/lib/python3.8/site-packages/mmengine/runner/loops.py", line 287, in run_iter outputs = self.runner.model.train_step( File "/root/miniconda3/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 114, in train_step losses = self._run_forward(data, mode='loss') # type: ignore File "/root/miniconda3/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 320, in _run_forward results = self(**data, mode=mode) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/root/mmdetection/mmdet/models/detectors/base.py", line 92, in forward return self.loss(inputs, data_samples) File "/root/mmdetection/mmdet/models/detectors/semi_base.py", line 87, in loss losses.update(**self.loss_by_pseudo_instances( File "/root/mmdetection/mmdet/models/detectors/soft_teacher.py", line 78, in loss_by_pseudo_instances losses.update(**self.rcnn_cls_loss_by_pseudo_instances( File "/root/mmdetection/mmdet/models/detectors/soft_teacher.py", line 255, in rcnn_cls_loss_by_pseudo_instances losses['loss_cls'] = losses['loss_cls'] * len( KeyError: 'loss_cls'

Additional information

The problem happen in the train. How can I fix this problem.

Feb 15 '23 06:02 southeastboss

I have experienced this error before. Maybe modifying the batch_size can help you.

Feb 16 '23 03:02 zys010219

I have experienced this error before. Maybe modifying the batch_size can help you.

How to modify the batch_size? Thank you.

My batch_size cfg as blow: batch_size=5, num_workers=5, persistent_workers=True, sampler=dict(type='GroupMultiSourceSampler', batch_size=5, source_ratio=[1, 4]),

Feb 16 '23 07:02 southeastboss

I modified batch_size from 5 to 6, then the train.py can work. You can have a try.

Feb 16 '23 07:02 zys010219

I modified batch_size from 5 to 6, then the train.py can work. You can have a try.

You just change the batch_size? I change my batch_size cfg as blow: train_dataloader = dict( batch_size=6, #batch_size=5, num_workers=5, persistent_workers=True, sampler=dict(type='GroupMultiSourceSampler', batch_size=5, source_ratio=[1, 4]), but it doesn't work.

Feb 16 '23 08:02 southeastboss

Perhaps you should also modify the batch-size in sampler? Actually I change batch_size several times and tried 2,4,5,6,8 etc, suddenly it works lol

Feb 16 '23 08:02 zys010219

Perhaps you should also modify the batch-size in sampler? Actually I change batch_size several times and tried 2,4,5,6,8 etc, suddenly it works lol

Oh, I change my batch_size cfg as blow: train_dataloader = dict( batch_size=4, num_workers=4, persistent_workers=True, sampler=dict( type='GroupMultiSourceSampler', batch_size=4, source_ratio=[1, 2]), It works, thank you.

Feb 16 '23 12:02 southeastboss

No worries. After training the model, training and testing are both work, but when I want to run image_demo.py to test it, it cannot run, reporting Keyerror: backbone. I have published this issue called: [Reimplementation] A bug during I run image_demo.py in semi-supervised object detection #9782, I don't know if you will encounter this problem after training the model. So I am looking forward to your help this time. lol

Feb 16 '23 12:02 zys010219

I met the same problems. (1)Why the original batch_ size，num_ When workers is 5, there will be a problem. Do you think the losses dictionary should be initialized to 0 so that there will be no loss of access to losses ['loss_cls'] in mmdet/models/roi_heads/bbox_heads/bbox_head.py? like, # init losses['loss_cls'] = torch.tensor(0.0).cuda() losses['acc'] = torch.tensor([0.0]).cuda() losses['loss_bbox'] = torch.tensor(0.0).cuda() (2)I found if i set max_iters and val_interval value unsuitable, the same problems appearred. Like, train_cfg = dict( type='IterBasedTrainLoop', max_iters=54000, val_interval=1200)，Whether max_iters must be a multiple of val_interval?

Feb 21 '23 02:02 yjcreation

Oh，I also have the problem. I think you can create a new issue to ask the group of mmdetection. I hope the group of mmdetection can fix the problem.

Feb 21 '23 08:02 southeastboss

@southeastboss Have you solved the problem #9782 ? I am still confused with it. @yjcreation May you have the #9782 too? I think we should let mmdetection group know about these two problem.

Feb 21 '23 12:02 zys010219

I‘m trainning the model. So, i haven't used the python tools/analysis_tools/analyze_results.py @zys010219

Feb 21 '23 12:02 yjcreation

@yjcreation If you have something new with the problem, update it here plz. I am looking forward for good news from you.

Feb 21 '23 12:02 zys010219

ok! @zys010219

Feb 21 '23 17:02 yjcreation

The code is not so robust and may fail at some cases, @Czm369 will take a look at it.

Feb 22 '23 06:02 ZwwWayne

I experienced the same error while training a soft-teacher with a custom dataset. None of what was proposed here solved the problem in my case but I found the source of the error for me.

Solutions :

Remove pretrained model in "load_from" argument.
Lower learning rate (from 0.01 to 0.001 in my case)

Justification : Through some debugging I discovered that the error occured when "rcnn_cls_loss_by_pseudo_instances()" is called and "rpn_results_list" is empty, which occurs when the previous call to "rpn_loss_by_pseudo_instances()" returns an empty list. I did not take the time to dig in more but I noticed the variable x (which are the features from the FPN) had values of the order of e-1 or e-0 and right before the error occurs, the values were closer to e-8 and then nan. This led me to try lowering the learning rate and changing the initial state of the model to prevent a divergence problem, which seems to have worked in my case. It may very well be that I did not dig far enough to understand the real problem but this fix seems to work for now.

Mar 15 '23 15:03 andrewcaunes

@andrewcaunes The method you mentioned works, thank you. Have you used tools, like image_demo.py or something, after training the model? I try to use it but it reported error: Keyerror: backbone. Actually I cannot solve the problem I published #9782, I am still confusing about it. Could you help me to fix it? thanks a lot.

Mar 20 '23 05:03 zys010219

Oh，I also have the problem. I think you can create a new issue to ask the group of mmdetection. I hope the group of mmdetection can fix the problem.

Have you successfully reproduced this code

Sep 13 '23 08:09 lzx101

I experienced the same error while training a soft-teacher with a custom dataset. None of what was proposed here solved the problem in my case but I found the source of the error for me.

Solutions :

Remove pretrained model in "load_from" argument.

Lower learning rate (from 0.01 to 0.001 in my case)

Justification : Through some debugging I discovered that the error occured when "rcnn_cls_loss_by_pseudo_instances()" is called and "rpn_results_list" is empty, which occurs when the previous call to "rpn_loss_by_pseudo_instances()" returns an empty list. I did not take the time to dig in more but I noticed the variable x (which are the features from the FPN) had values of the order of e-1 or e-0 and right before the error occurs, the values were closer to e-8 and then nan. This led me to try lowering the learning rate and changing the initial state of the model to prevent a divergence problem, which seems to have worked in my case. It may very well be that I did not dig far enough to understand the real problem but this fix seems to work for now.

Your answer makes a lot of sense, especially the "Lower learning rate" approach. I think the parameter settings of softteacher, such as: semi_train_cfg=dict( cls_pseudo_thr=0.4, freeze_teacher=False, jitter_scale=0.06, jitter_times=5, min_pseudo_bbox_wh=( 0.01, 0.01, ), pseudo_label_initial_score_thr=0.9, reg_pseudo_thr=0.01, rpn_pseudo_thr=0.3, sup_weight=1.0, unsup_weight=0.5) are the key factors for this error. The settings of these parameters do not seem to be very free. Each parameter needs to be well coordinated, otherwise this error will appear "losses['loss_cls'] = losses['loss_cls'] * len( KeyError: 'loss_cls'"

Jun 12 '25 09:06 XiaoSiJi-GCY