mmdetection During training of RTMDet loss_box and loss

Describe the bug When trying to train an RTMDet model of any size using MMDetection on a COCO format dataset, during training the loss and loss_cls parameters will descend as normal, but the loss_box and loss_mask parameters start and stay at 0 for all of training. The model also does not produce any results during inference.

Reproduction

The exact training command: tools/dist_train.sh configs/custom/rtmdet-ins-custom-s.py 2 --auto-scale-lr

My config file:

_base_ = '../rtmdet/rtmdet-ins_s_8xb32-300e_coco.py'
dataset_type = 'CocoDataset'
data_root = '../../datasets/MyDataset/'
num_classes = 8
classes = ('Circular', 'Elliptical', 'Triangular', 'Quadrilateral', 'Polygonal', 'Capsule', 'Unique', 'Spheroid')
metainfo = {
    'classes': ('Circular', 'Elliptical', 'Triangular', 'Quadrilateral', 'Polygonal', 'Capsule', 'Unique', 'Spheroid'),
    'palette': [
        (135, 206, 235),
        (255, 192, 203),
        (255, 218, 185),
        (147, 112, 219),
        (60, 179, 113),
        (255, 165, 0),
        (220, 20, 60),
        (255, 255, 0)
    ]
}

train_dataloader = dict(
    batch_size = 8,
    num_workers = 10,
    dataset = dict(
        data_root=data_root,
        metainfo=metainfo,
        ann_file=data_root + '/annotations/instances_train.json',
        data_prefix=dict(img=data_root + 'train/')
    )
)

find_unused_parameters=True
val_dataloader = dict(
    batch_size = 4,
    num_workers = 10,
    dataset = dict(
        data_root=data_root,
        metainfo=metainfo,
        ann_file=data_root + '/annotations/instances_val.json',
        data_prefix=dict(img=data_root + 'val/')
    )
)

test_dataloader = val_dataloader

val_evaluator = dict(ann_file=data_root + 'annotations/instances_val.json')
test_evaluator = val_evaluator

A sample of my logs:

11/09 10:01:10 - mmengine - INFO - Epoch(train)   [1][  50/3256]  lr: 1.9623e-05  eta: 1 day, 10:48:45  time: 0.3850  data_time: 0.0542  memory: 4411  loss: 0.5551  loss_cls: 0.5551  loss_bbox: 0.0000  loss_mask: 0.0000
11/09 10:01:24 - mmengine - INFO - Epoch(train)   [1][ 100/3256]  lr: 3.9643e-05  eta: 1 day, 5:15:10  time: 0.2621  data_time: 0.0017  memory: 4411  loss: 0.5109  loss_cls: 0.5109  loss_bbox: 0.0000  loss_mask: 0.0000
11/09 10:01:37 - mmengine - INFO - Epoch(train)   [1][ 150/3256]  lr: 5.9663e-05  eta: 1 day, 3:24:11  time: 0.2623  data_time: 0.0015  memory: 4411  loss: 0.4392  loss_cls: 0.4392  loss_bbox: 0.0000  loss_mask: 0.0000
11/09 10:01:50 - mmengine - INFO - Epoch(train)   [1][ 200/3256]  lr: 7.9683e-05  eta: 1 day, 2:35:58  time: 0.2678  data_time: 0.0014  memory: 4411  loss: 0.3513  loss_cls: 0.3513  loss_bbox: 0.0000  loss_mask: 0.0000

The only modifications I made to base configs were to increase the maximum number of detections to 500 (I am doing small object detection so this is needed for my use case) and to change the checkpoint interval to 5 so that I could evaluate my progress in finer steps. I have not modified the actual mmdetection codebase.

I am using a custom instance segmentation dataset in COCO format created synthetically. Due to the nature of my task I cannot share my dataset in full. However, the directory structure is as follows:

> Dataset
| > annotations
| | instances_train.json
| | instances_val.json
| | instances_test.json
| > train
| | trainimage0.png
| | trainimage1.png
| | trainimage2.png
| | ...
| | > val
| | valimage0.png
| | valimage1.png
| | valimage2.png
| | ...
| | > test
| | testimage0.png
| | testimage1.png
| | testimage2.png
| | ...

And here is a sample of my images and annotations:

"images": [
        {
            "id": 0,
            "file_name": "img_0.png",
            "height": 1800,
            "width": 1800
        },
        {
            "id": 1,
            "file_name": "img_1.png",
            "height": 1800,
            "width": 1800
        },
],
"annotations":[
        {
            "id": 13384448,
            "image_id": 74402,
            "category_id": 0,
            "segmentation": {
                "size": [
                    1800,
                    1800
                ],
                "counts": "WhW74mg1>E7J5K4M4L3N2M3M2O2N1O1N3O0O1O1O1O1O2O0O100O10000O2O0000000000001O000001O000O10000O10001N1O100O1O1O100O2N1N2O2N1N3N2M3M3M4L4K5J8GSZPh2"
            },
            "bbox": [
                131.0,
                1480.0,
                66.0,
                66.0
            ],
            "area": 3460,
            "iscrowd": 0
        },
        {
            "id": 13384449,
            "image_id": 74402,
            "category_id": 0,
            "segmentation": {
                "size": [
                    1800,
                    1800
                ],
                "counts": "Rl]?:kg16K4M3L3M3M4L3M3M2N3M3M3N1O2N2N1O2N100O2O0O2O0O10001O0O100000000000000000000001O000O101O0O101N100O2O0O2N2N1O2M3N1N3L4M3M3M3M4L3M4L4L6H\\ef_2"
            },
            "bbox": [
                280.0,
                1696.0,
                68.0,
                66.0
            ],
            "area": 3403,
            "iscrowd": 0
        }
],

I have written a script to visualize my dataset to confirm that my masks and bounding boxes align with their respective instances as expected, so the annotations are definitely accurate.

Environment

sys.platform: linux
Python: 3.9.16 (main, Mar  8 2023, 14:00:05) [GCC 11.2.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA RTX A5500
CUDA_HOME: /usr/local/cuda-11.7
NVCC: Cuda compilation tools, release 11.7, V11.7.64
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.0.1
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.5
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.15.2
OpenCV: 4.7.0
MMEngine: 0.7.3
MMDetection: 3.2.0+fe3f809

Additional Environment Info:

Environment is running inside of WSL2 with CUDA access enabled.
Installation instructions were followed as per the mmdetection website guide exactly. Pytorch was installed using the official torch installation instructions for conda and WSL.

Nov 09 '23 15:11 h-fernand

Just an update, I've tried running everything in the official Docker container and the same issue occurs.

Nov 14 '23 13:11 h-fernand

Hello, have you solved this problem? @h-fernand

Jan 25 '24 09:01 Bainily028

Hi @h-fernand , I'm dealing with exactly the same problem, have you solved it?

Jan 25 '24 10:01 ggutierrezpereira

i also have same problem, have you solved it?

Mar 12 '24 22:03 Shaoxiang2021

I believe I had an issue with the area of my instances not being set correctly, and I also was trying to use RLE encoded masks. For some reason I was never able to get RLE encoded masks working properly. When I switched my annotations to polygons everything worked fine. I think something might be broken in the LoadAnnotations transform for RLE annotations but I'm not sure what.

Mar 13 '24 13:03 h-fernand

I believe I had an issue with the area of my instances not being set correctly, and I also was trying to use RLE encoded masks. For some reason I was never able to get RLE encoded masks working properly. When I switched my annotations to polygons everything worked fine. I think something might be broken in the LoadAnnotations transform for RLE annotations but I'm not sure what.

Thanks for your reply. It really works. But i got the Problem in Validation, do you also have this problem at the same time? grafik

Mar 13 '24 14:03 Shaoxiang2021

I believe I had an issue with the area of my instances not being set correctly, and I also was trying to use RLE encoded masks. For some reason I was never able to get RLE encoded masks working properly. When I switched my annotations to polygons everything worked fine. I think something might be broken in the LoadAnnotations transform for RLE annotations but I'm not sure what.

Thanks for your reply. It really works. But i got the Problem in Validation, do you also have this problem at the same time?

Ok, thanks. I solve the problem. set right learning rate is important. And i use self defined train_pipeline also worked with poly2mask = False.

grafik

Mar 13 '24 15:03 Shaoxiang2021

mmdetection
mmdetection copied to clipboard

During training of RTMDet loss_box and loss_mask are always 0

mmdetection mmdetection copied to clipboard

During training of RTMDet loss_box and loss_mask are always 0

mmdetection
mmdetection copied to clipboard