mmdetection
mmdetection copied to clipboard
During training of RTMDet loss_box and loss_mask are always 0
Describe the bug When trying to train an RTMDet model of any size using MMDetection on a COCO format dataset, during training the loss and loss_cls parameters will descend as normal, but the loss_box and loss_mask parameters start and stay at 0 for all of training. The model also does not produce any results during inference.
Reproduction
The exact training command: tools/dist_train.sh configs/custom/rtmdet-ins-custom-s.py 2 --auto-scale-lr
My config file:
_base_ = '../rtmdet/rtmdet-ins_s_8xb32-300e_coco.py'
dataset_type = 'CocoDataset'
data_root = '../../datasets/MyDataset/'
num_classes = 8
classes = ('Circular', 'Elliptical', 'Triangular', 'Quadrilateral', 'Polygonal', 'Capsule', 'Unique', 'Spheroid')
metainfo = {
'classes': ('Circular', 'Elliptical', 'Triangular', 'Quadrilateral', 'Polygonal', 'Capsule', 'Unique', 'Spheroid'),
'palette': [
(135, 206, 235),
(255, 192, 203),
(255, 218, 185),
(147, 112, 219),
(60, 179, 113),
(255, 165, 0),
(220, 20, 60),
(255, 255, 0)
]
}
train_dataloader = dict(
batch_size = 8,
num_workers = 10,
dataset = dict(
data_root=data_root,
metainfo=metainfo,
ann_file=data_root + '/annotations/instances_train.json',
data_prefix=dict(img=data_root + 'train/')
)
)
find_unused_parameters=True
val_dataloader = dict(
batch_size = 4,
num_workers = 10,
dataset = dict(
data_root=data_root,
metainfo=metainfo,
ann_file=data_root + '/annotations/instances_val.json',
data_prefix=dict(img=data_root + 'val/')
)
)
test_dataloader = val_dataloader
val_evaluator = dict(ann_file=data_root + 'annotations/instances_val.json')
test_evaluator = val_evaluator
A sample of my logs:
11/09 10:01:10 - mmengine - INFO - Epoch(train) [1][ 50/3256] lr: 1.9623e-05 eta: 1 day, 10:48:45 time: 0.3850 data_time: 0.0542 memory: 4411 loss: 0.5551 loss_cls: 0.5551 loss_bbox: 0.0000 loss_mask: 0.0000
11/09 10:01:24 - mmengine - INFO - Epoch(train) [1][ 100/3256] lr: 3.9643e-05 eta: 1 day, 5:15:10 time: 0.2621 data_time: 0.0017 memory: 4411 loss: 0.5109 loss_cls: 0.5109 loss_bbox: 0.0000 loss_mask: 0.0000
11/09 10:01:37 - mmengine - INFO - Epoch(train) [1][ 150/3256] lr: 5.9663e-05 eta: 1 day, 3:24:11 time: 0.2623 data_time: 0.0015 memory: 4411 loss: 0.4392 loss_cls: 0.4392 loss_bbox: 0.0000 loss_mask: 0.0000
11/09 10:01:50 - mmengine - INFO - Epoch(train) [1][ 200/3256] lr: 7.9683e-05 eta: 1 day, 2:35:58 time: 0.2678 data_time: 0.0014 memory: 4411 loss: 0.3513 loss_cls: 0.3513 loss_bbox: 0.0000 loss_mask: 0.0000
The only modifications I made to base configs were to increase the maximum number of detections to 500 (I am doing small object detection so this is needed for my use case) and to change the checkpoint interval to 5 so that I could evaluate my progress in finer steps. I have not modified the actual mmdetection codebase.
I am using a custom instance segmentation dataset in COCO format created synthetically. Due to the nature of my task I cannot share my dataset in full. However, the directory structure is as follows:
> Dataset
| > annotations
| | instances_train.json
| | instances_val.json
| | instances_test.json
| > train
| | trainimage0.png
| | trainimage1.png
| | trainimage2.png
| | ...
| | > val
| | valimage0.png
| | valimage1.png
| | valimage2.png
| | ...
| | > test
| | testimage0.png
| | testimage1.png
| | testimage2.png
| | ...
And here is a sample of my images and annotations:
"images": [
{
"id": 0,
"file_name": "img_0.png",
"height": 1800,
"width": 1800
},
{
"id": 1,
"file_name": "img_1.png",
"height": 1800,
"width": 1800
},
],
"annotations":[
{
"id": 13384448,
"image_id": 74402,
"category_id": 0,
"segmentation": {
"size": [
1800,
1800
],
"counts": "WhW74mg1>E7J5K4M4L3N2M3M2O2N1O1N3O0O1O1O1O1O2O0O100O10000O2O0000000000001O000001O000O10000O10001N1O100O1O1O100O2N1N2O2N1N3N2M3M3M4L4K5J8GSZPh2"
},
"bbox": [
131.0,
1480.0,
66.0,
66.0
],
"area": 3460,
"iscrowd": 0
},
{
"id": 13384449,
"image_id": 74402,
"category_id": 0,
"segmentation": {
"size": [
1800,
1800
],
"counts": "Rl]?:kg16K4M3L3M3M4L3M3M2N3M3M3N1O2N2N1O2N100O2O0O2O0O10001O0O100000000000000000000001O000O101O0O101N100O2O0O2N2N1O2M3N1N3L4M3M3M3M4L3M4L4L6H\\ef_2"
},
"bbox": [
280.0,
1696.0,
68.0,
66.0
],
"area": 3403,
"iscrowd": 0
}
],
I have written a script to visualize my dataset to confirm that my masks and bounding boxes align with their respective instances as expected, so the annotations are definitely accurate.
Environment
sys.platform: linux
Python: 3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA RTX A5500
CUDA_HOME: /usr/local/cuda-11.7
NVCC: Cuda compilation tools, release 11.7, V11.7.64
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.0.1
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.7
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.5
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.15.2
OpenCV: 4.7.0
MMEngine: 0.7.3
MMDetection: 3.2.0+fe3f809
Additional Environment Info:
- Environment is running inside of WSL2 with CUDA access enabled.
- Installation instructions were followed as per the mmdetection website guide exactly. Pytorch was installed using the official torch installation instructions for conda and WSL.
Just an update, I've tried running everything in the official Docker container and the same issue occurs.
Hello, have you solved this problem? @h-fernand
Hi @h-fernand , I'm dealing with exactly the same problem, have you solved it?
i also have same problem, have you solved it?
I believe I had an issue with the area of my instances not being set correctly, and I also was trying to use RLE encoded masks. For some reason I was never able to get RLE encoded masks working properly. When I switched my annotations to polygons everything worked fine. I think something might be broken in the LoadAnnotations transform for RLE annotations but I'm not sure what.
I believe I had an issue with the area of my instances not being set correctly, and I also was trying to use RLE encoded masks. For some reason I was never able to get RLE encoded masks working properly. When I switched my annotations to polygons everything worked fine. I think something might be broken in the LoadAnnotations transform for RLE annotations but I'm not sure what.
Thanks for your reply. It really works. But i got the Problem in Validation, do you also have this problem at the same time?
I believe I had an issue with the area of my instances not being set correctly, and I also was trying to use RLE encoded masks. For some reason I was never able to get RLE encoded masks working properly. When I switched my annotations to polygons everything worked fine. I think something might be broken in the LoadAnnotations transform for RLE annotations but I'm not sure what.
Thanks for your reply. It really works. But i got the Problem in Validation, do you also have this problem at the same time?
Ok, thanks. I solve the problem. set right learning rate is important. And i use self defined train_pipeline also worked with poly2mask = False.