dinov3 Compatibility of Lighter Backbones with ADE20k Segmentation Head (Embedding Size Mismatch)

Hi team,

First off, amazing work on DINOv3 — the results and design are truly impressive!

I wanted to highlight a potential limitation when using lighter backbones with the segmentation head trained on the ADE20k dataset. The segmentation head expects fixed-size embeddings of 4096 dimensions, while the distilled backbones (e.g., smaller variants) output embeddings of lower dimensionality (around 2048). Because of this mismatch, these lighter models can’t be used directly with the ADE20k-trained segmentation head.

Since the performance gap in segmentation between heavier and lighter backbones isn’t substantial, it would be fantastic if segmentation head weights trained with lighter backbones could be made available. If that’s not feasible, sharing the training scripts for the segmentation task would be incredibly helpful for reproducing and extending your work.

Any guidance or insights on this would be greatly appreciated!

Best, Anit

Oct 08 '25 12:10 PixxelAnit

Hi, thank you for your interest in DINOv3 ! Yes, given the different sizes of backbones, a different head would be needed for each. Code is available for training a linear segmentation head that gives exciting results (55.9 mIoU on a ViT-7B), that you can also run using smaller backbones. Please refer to the dinov3/eval/segmentation/ folder as well as the instructions in the README.md that will guide you with running the code.

Oct 09 '25 14:10 se-yi

Hi, thank you for your interest in DINOv3 ! Yes, given the different sizes of backbones, a different head would be needed for each. Code is available for training a linear segmentation head that gives exciting results (55.9 mIoU on a ViT-7B), that you can also run using smaller backbones. Please refer to the dinov3/eval/segmentation/ folder as well as the instructions in the README.md that will guide you with running the code.

Thank you very much for your helpful response! I’d like to ask a few follow-up questions about some details mentioned in the paper, and I’d really appreciate your clarification:

In Chapter 7, are all the semantic segmentation results obtained using the linear segmentation head rather than the M2F head?

For all the tasks in Chapter 7, is the backbone frozen during training?

Oct 14 '25 03:10 Zoulinx

Hi, of course ! In section 7, the segmentation results in table 14 were obtained with the linear segmentation head (table 11 in section 6 used a M2F head). All of our evaluation experiments (classification, segmentation, depth, etc.) use a frozen backbone, to show how we can use a single frozen backbone for multiple purposes, without having to do task-specific finetuning.

Oct 15 '25 13:10 se-yi