dinov2 icon indicating copy to clipboard operation
dinov2 copied to clipboard

Poor Object Detection Performance with DINOv2 Backbone and Faster R-CNN Head on Cityscapes Dataset

Open busenuraktilav opened this issue 5 months ago • 10 comments

I am working on an object detection task using the DINOv2 backbone with a Faster R-CNN head. While I have successfully implemented semantic segmentation with a linear head on the Cityscapes dataset and replicated the results from the relevant paper, I am encountering significant challenges in applying the DINOv2 backbone for object detection.

I used the dinov2_vits14_pretrain model and added a Faster R-CNN head as follows:

def create_model(num_classes):
    backbone = Dinov2Backbone()
    backbone.out_channels = 384  # Set the number of output channels

    downsampling_factor = 16
    feature_map_size = 630 // downsampling_factor

    anchor_size = (feature_map_size,)  # Single size tuple
    anchor_generator = AnchorGenerator(sizes=(anchor_size), aspect_ratios=((0.5, 1.0, 2.0),))
    roi_pooler = MultiScaleRoIAlign(featmap_names=['0'], output_size=7, sampling_ratio=2)

    model = FasterRCNN(backbone, min_size=630, num_classes=num_classes,
                       rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler)
    model.transform = IdentityTransform()
    return model

Dataset and Training: I used the Cityscapes dataset, which includes classes such as 'person', 'rider', 'car', 'truck', 'bus', 'train', 'motorcycle', and 'bicycle'. I preprocess the images and by protecting aspect ratio(resizing and padding) I transformed them 630x630. Following the training procedure outlined in the Faster R-CNN tutorial link to the tutorial, I trained the model for 15 epochs.

Issue Encountered: The model's performance is disappointing. It only predicts 'car', 'person' and 'rider' classes, and the accuracy of these predictions is poor. It puts bounding boxes to the unrelated parts of the image and does not even predict the other classes at all (AP scores are 0 for the others but for these three classes AP scores are 0.99). The results are not aligned with the expected performance, considering the model's capabilities in semantic segmentation tasks.

Questions and Assistance Request:

  1. Is there any existing documentation or examples of using DINOv2 for object detection tasks?
  2. Is the DINOv2 backbone suitable for object detection tasks, or is it primarily designed for other purposes like semantic segmentation?
  3. Any suggestions for modifications or alternative approaches to improve object detection results with DINOv2 on the Cityscapes dataset would be greatly appreciated.

busenuraktilav avatar Jan 04 '24 17:01 busenuraktilav