[Bug] A bug happen in the train of semi-supervised object detection
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] I have read the FAQ documentation but cannot get the expected help.
- [X] The bug has not been fixed in the latest version (master) or latest version (3.x).
Task
I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.
Branch
3.x branch https://github.com/open-mmlab/mmdetection/tree/3.x
Environment
sys.platform: linux Python: 3.8.10 (default, Jun 4 2021, 15:09:15) [GCC 7.5.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 3060 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.109 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.11.0+cu113 PyTorch compiling details: PyTorch built with:
GCC 7.3 C++ Version: 201402 Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e) OpenMP 201511 (a.k.a. OpenMP 4.5) LAPACK is enabled (usually provided by MKL) NNPACK is enabled CPU capability usage: AVX2 CUDA Runtime 11.3 NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86 CuDNN 8.2 Magma 2.5.2 Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, TorchVision: 0.12.0+cu113 OpenCV: 4.6.0 MMEngine: 0.3.2 MMDetection: 3.0.0rc5+92d03df
Reproduces the problem - code sample
I use the config which inherit the config in configs/soft_teacher/soft-teacher_faster-rcnn_r50-caffe_fpn_180k_semi-0.1-coco.py and change the key of 'num_classes', 'labeled_dataset.ann_file', 'unlabeled_dataset.ann_file', ‘train_cfg ’ and add key of 'metainfo' in each dataset dict part.
the metainfo as below metainfo=dict( classes=('a', 'b', 'c'), palette=[(255, 0, 0), (0, 255, 0), (0, 0, 255)]),
the train_cfg as below train_cfg = dict(type='IterBasedTrainLoop', max_iters=3000, val_interval=600)
Reproduces the problem - command or script
python tools/train.py configs/faster_rcnn/faster_rcnn_semi_detection.py
Reproduces the problem - error message
02/15 13:42:06 - mmengine - INFO - Checkpoints will be saved to /root/mmdetection/work_dir/faster_unlabeled_test.
02/15 13:42:39 - mmengine - INFO - Iter(train) [ 50/3000] lr: 9.9098e-04 eta: 0:31:54 time: 0.6489 data_time: 0.0164 memory: 9140 loss: 3.2496 sup_loss_rpn_cls: 0.7732 sup_loss_rpn_bbox: 0.3367 sup_loss_cls: 0.8544 sup_acc: 93.5547 sup_loss_bbox: 0.0448 unsup_loss_rpn_cls: 0.8679 unsup_loss_rpn_bbox: 0.0009 unsup_loss_cls: 0.3717 unsup_acc: 100.0000 unsup_loss_bbox: 0.0000
02/15 13:43:10 - mmengine - INFO - Iter(train) [ 100/3000] lr: 1.9920e-03 eta: 0:30:40 time: 0.6205 data_time: 0.0169 memory: 9140 loss: 2.0364 sup_loss_rpn_cls: 0.8109 sup_loss_rpn_bbox: 0.3330 sup_loss_cls: 0.4660 sup_acc: 94.9219 sup_loss_bbox: 0.0895 unsup_loss_rpn_cls: 0.2801 unsup_loss_rpn_bbox: 0.0000 unsup_loss_cls: 0.0568 unsup_acc: 100.0000 unsup_loss_bbox: 0.0000
Traceback (most recent call last):
File "tools/train.py", line 130, in
Additional information
The problem happen in the train. How can I fix this problem.
I have experienced this error before. Maybe modifying the batch_size can help you.
I have experienced this error before. Maybe modifying the batch_size can help you.
How to modify the batch_size? Thank you.
My batch_size cfg as blow: batch_size=5, num_workers=5, persistent_workers=True, sampler=dict(type='GroupMultiSourceSampler', batch_size=5, source_ratio=[1, 4]),
I modified batch_size from 5 to 6, then the train.py can work. You can have a try.
I modified batch_size from 5 to 6, then the train.py can work. You can have a try.
You just change the batch_size? I change my batch_size cfg as blow: train_dataloader = dict( batch_size=6, #batch_size=5, num_workers=5, persistent_workers=True, sampler=dict(type='GroupMultiSourceSampler', batch_size=5, source_ratio=[1, 4]), but it doesn't work.
Perhaps you should also modify the batch-size in sampler? Actually I change batch_size several times and tried 2,4,5,6,8 etc, suddenly it works lol
Perhaps you should also modify the batch-size in sampler? Actually I change batch_size several times and tried 2,4,5,6,8 etc, suddenly it works lol
Oh, I change my batch_size cfg as blow: train_dataloader = dict( batch_size=4, num_workers=4, persistent_workers=True, sampler=dict( type='GroupMultiSourceSampler', batch_size=4, source_ratio=[1, 2]), It works, thank you.
No worries. After training the model, training and testing are both work, but when I want to run image_demo.py to test it, it cannot run, reporting Keyerror: backbone. I have published this issue called: [Reimplementation] A bug during I run image_demo.py in semi-supervised object detection #9782, I don't know if you will encounter this problem after training the model. So I am looking forward to your help this time. lol
I met the same problems. (1)Why the original batch_ size,num_ When workers is 5, there will be a problem. Do you think the losses dictionary should be initialized to 0 so that there will be no loss of access to losses ['loss_cls'] in mmdet/models/roi_heads/bbox_heads/bbox_head.py?
like, # init losses['loss_cls'] = torch.tensor(0.0).cuda() losses['acc'] = torch.tensor([0.0]).cuda() losses['loss_bbox'] = torch.tensor(0.0).cuda()
(2)I found if i set max_iters and val_interval value unsuitable, the same problems appearred. Like, train_cfg = dict( type='IterBasedTrainLoop', max_iters=54000, val_interval=1200),Whether max_iters must be a multiple of val_interval?
Oh,I also have the problem. I think you can create a new issue to ask the group of mmdetection. I hope the group of mmdetection can fix the problem.
@southeastboss Have you solved the problem #9782 ? I am still confused with it. @yjcreation May you have the #9782 too? I think we should let mmdetection group know about these two problem.
I‘m trainning the model. So, i haven't used the python tools/analysis_tools/analyze_results.py @zys010219
@yjcreation If you have something new with the problem, update it here plz. I am looking forward for good news from you.
ok! @zys010219
The code is not so robust and may fail at some cases, @Czm369 will take a look at it.
I experienced the same error while training a soft-teacher with a custom dataset. None of what was proposed here solved the problem in my case but I found the source of the error for me.
Solutions :
- Remove pretrained model in "load_from" argument.
- Lower learning rate (from 0.01 to 0.001 in my case)
Justification : Through some debugging I discovered that the error occured when "rcnn_cls_loss_by_pseudo_instances()" is called and "rpn_results_list" is empty, which occurs when the previous call to "rpn_loss_by_pseudo_instances()" returns an empty list. I did not take the time to dig in more but I noticed the variable x (which are the features from the FPN) had values of the order of e-1 or e-0 and right before the error occurs, the values were closer to e-8 and then nan. This led me to try lowering the learning rate and changing the initial state of the model to prevent a divergence problem, which seems to have worked in my case. It may very well be that I did not dig far enough to understand the real problem but this fix seems to work for now.
@andrewcaunes The method you mentioned works, thank you. Have you used tools, like image_demo.py or something, after training the model? I try to use it but it reported error: Keyerror: backbone. Actually I cannot solve the problem I published #9782, I am still confusing about it. Could you help me to fix it? thanks a lot.
Oh,I also have the problem. I think you can create a new issue to ask the group of mmdetection. I hope the group of mmdetection can fix the problem.
Have you successfully reproduced this code
I experienced the same error while training a soft-teacher with a custom dataset. None of what was proposed here solved the problem in my case but I found the source of the error for me.
Solutions :
- Remove pretrained model in "load_from" argument.
- Lower learning rate (from 0.01 to 0.001 in my case)
Justification : Through some debugging I discovered that the error occured when "rcnn_cls_loss_by_pseudo_instances()" is called and "rpn_results_list" is empty, which occurs when the previous call to "rpn_loss_by_pseudo_instances()" returns an empty list. I did not take the time to dig in more but I noticed the variable x (which are the features from the FPN) had values of the order of e-1 or e-0 and right before the error occurs, the values were closer to e-8 and then nan. This led me to try lowering the learning rate and changing the initial state of the model to prevent a divergence problem, which seems to have worked in my case. It may very well be that I did not dig far enough to understand the real problem but this fix seems to work for now.
Your answer makes a lot of sense, especially the "Lower learning rate" approach. I think the parameter settings of softteacher, such as: semi_train_cfg=dict( cls_pseudo_thr=0.4, freeze_teacher=False, jitter_scale=0.06, jitter_times=5, min_pseudo_bbox_wh=( 0.01, 0.01, ), pseudo_label_initial_score_thr=0.9, reg_pseudo_thr=0.01, rpn_pseudo_thr=0.3, sup_weight=1.0, unsup_weight=0.5) are the key factors for this error. The settings of these parameters do not seem to be very free. Each parameter needs to be well coordinated, otherwise this error will appear "losses['loss_cls'] = losses['loss_cls'] * len( KeyError: 'loss_cls'"