RT-DETR
RT-DETR copied to clipboard
Improve Bounding boxes classes performances
RT-DETR v2 Training Issues with Custom Dataset
I'm currently training RT-DETR v2 (PyTorch implementation) on a custom dataset. While the model performs well at detecting bounding boxes and their coordinates, it's showing suboptimal performance in class identification.
Questions
1. Class Performance Emphasis
Is there a way to adjust the training process to put more emphasis on classification performance?
2. Separate Classification Model
I noticed there's a dedicated classification task in the codebase:
TASKS: Dict[str, BaseSolver] = {
'classification': ClasSolver,
'detection': DetSolver,
}
Would training a separate classification model improve the overall performance?
3. Performance Improvement
What are some recommended approaches to improve the model's class identification accuracy?
-
You can modify
loss_vflweight. https://github.com/lyuwenyu/RT-DETR/blob/0b6972de10bc968045aba776ec1a60efea476165/rtdetrv2_pytorch/configs/rtdetrv2/include/rtdetrv2_r50vd.yml#L73 -
No, they are for two separate tasks.
-
Reference 1.
Thanks for your reply.
I'm thinking of using DINOv2 as backbone. Will it be an easy task to do ? If yes which files will I need to modify ?
Thanks again
Yes, you just need to register new backbone using @register()
see details https://github.com/lyuwenyu/RT-DETR/blob/main/rtdetrv2_pytorch/src/nn/backbone/hgnetv2.py#L272
Then replace old one with your registered module name in config
https://github.com/lyuwenyu/RT-DETR/blob/main/rtdetrv2_pytorch/configs/rtdetrv2/include/rtdetrv2_r50vd.yml#L13
Great, thanks I will try to do that. In the meantime I tried to use the pre-defined backbones TimmModel and HGNetv2 without success.
Implementation Issues with TimmModel and HGNetv2 Backbones
HGNetv2 Implementation
Configuration
# rtdetrv2_r50vd.yml
RTDETR:
backbone: HGNetv2
encoder: HybridEncoder
decoder: RTDETRTransformerv2
# rtdetrv2_r18vd_120e_coco.yml
HGNetv2:
name: L
Error
RuntimeError: Given groups=1, weight of size [256, 128, 1, 1], expected input[16, 512, 80, 80]
to have 128 channels, but got 512 channels instead
Error location: hybrid_encoder.py, line 294
TimmModel Implementation
Configuration
# rtdetrv2_r50vd.yml
RTDETR:
backbone: TimmModel
encoder: HybridEncoder
decoder: RTDETRTransformerv2
# rtdetrv2_r18vd_120e_coco.yml
TimmModel:
name: resnet34
return_layers: ['layer2', 'layer4']
Error
AssertionError: assert len(feats) == len(self.in_channels)
Error location: hybrid_encoder.py, line 293
The assertion error suggests a mismatch between the number of feature maps being returned and the expected number of input channels in the encoder.
Would you like help resolving these issues, particularly with the TimmModel implementation?
And this line should adapt to specific bakcbone.
https://github.com/lyuwenyu/RT-DETR/blob/0b6972de10bc968045aba776ec1a60efea476165/rtdetrv2_pytorch/configs/rtdetrv2/include/rtdetrv2_r50vd.yml#L29
ViT and HybridEncoder Compatibility Analysis
Thanks, it finally worked. I tried to use Vision Transformer (ViT) architecture as backbone with TimmModel, but it seems like its output is not compatible with HybridEncoder expectations.
Here is the summary of what I understood:
HybridEncoder Expectations
- It expects inputs in CNN format (batch_size, channels, height, width)
- Default in_channels=[512, 1024, 2048] (typical ResNet feature map channels)
- Input features should have decreasing spatial dimensions with feat_strides=[8, 16, 32]
ViT Last 3 Layers Output
- Shape: (batch_size, N_patches, 768)
- No explicit spatial structure
- Constant channel dimension (768)
- All layers have same dimensions
Mismatch Issues
1. Dimensional Structure
- HybridEncoder expects 4D tensors (B,C,H,W)
- ViT outputs 3D tensors (B,N,D)
2. Channel Progression
- HybridEncoder expects increasing channels (512->1024->2048)
- ViT has constant channels (768)
3. Spatial Resolution
- HybridEncoder expects decreasing spatial dimensions
- ViT maintains constant number of patches
I'm trying to adapt ViT outputs. But I think the adaptation might not be optimal because:
1- ViT's strength lies in global attention
2- Forcing spatial structure might lose the global relationship information
3- The original feature hierarchy of ResNet is fundamentally different from ViT's feature representation
Can you please confirm that? And if there is a way to make them compatible.
Thanks a lot!
Yes, I think you are right.
One possible solution is to add an extra adaptation module. You can reference this paper.
Ok thanks very much. I will check it.