Distill DINOv3 into any Model Architecture
Hi, we implemented distillation of DINOv3 into any model architecture. Here’s how you can distill from DINOv3 into a ResNet:
import lightly_train
if __name__ == “__main__“:
lightly_train.train(
out=“out/my_experiment”,
data=“my_data_dir”,
model=“torchvision/resnet50”,
method=“distillation”,
method_args={
“teacher”: “dinov3/vitl16",
# Replace with your own URL
“teacher_url”: “https://dinov3.llamameta.net/dinov3_vits16/dinov3_vits16_pretrain_lvd1689m-08c60483.pth<SOME-KEY>“,
}
)
Check https://github.com/lightly-ai/lightly-train for the code.
Bonus: if you’re into semantic segmentation, you can also use DINOv3 weights as a backbone for EoMT.
Hi,
Can you tell me what is the distillation head like? For example, if teacher model outputs a 4096 vector as embeddings, but student is resnet18 whose output dimension is only 512, what would the connector be like?
Can it be something like this? Or is there better structure:
nn.Sequential(
nn.Linear(512, 4096),
nn.ReLU(),
nn.Linear(4096, 4096
)
Hi @CoinCheung,
We inherit the distillation head from DINO. You can have a look at the architectural details here: https://github.com/lightly-ai/lightly-train/blob/main/src/lightly_train/_methods/distillationv2/distillationv2.py#L113
In any case, we aim at hiding the complexity of the distillation algorithm from the user, so you can simply use the method by specifying method=“distillation” in lightly_train.train().
The default is to use a single linear layer that projects the student features to the same dimension as the teacher (nn.Linear(512, 4096))
@yutong-xiang-97 that's very impressive!
I noticed that dinov3's ViT-B/L/H checkpoints are missing dino and ibot head parameters (cf. https://github.com/facebookresearch/dinov3/issues/84). In this case, how do you handle these missing keys during knowledge distillation, assuming the teacher network is frozen?
Hi @goutamyg, thank you for your kind words!
The DINO and iBOT heads are instantiated from scratch in DINOv3/DINOv2 as far as I know. The DINO and iBOT head weights are not released in either repo.
In our distillation implementation, we distill from the backbone directly; the DINO and iBOT heads are not used. The distillation head is instead used for matching the dimensions of the teacher and the student.
@yutong-xiang-97 Both the papers mention that the teacher network is frozen during distillation. So, it is not clear whether it includes the DINO and iBOT heads or not :)
I checked the implementation for model distillation in your codebase. If I understand correctly, the teacher and student backbone outputs are mapped to higher dimension by DistillationV2Head and KL-divergence loss is applied on the resulting higher dimensional embeddings. In this case, what are the inputs to the student and teacher network? Is it just local and global crops, respectively? Can you share if there is a compact encoder (e.g., vit-small or vit-tiny) that you have distilled with this appraoach? I am curious to evaluate its performance. Thank you!
@goutamyg In our distillation methods the teacher and student receive exactly the same input (this is inspired by this paper).
There is only a projection head on top of the student model to match the student embedding space to that of the teacher.
In DistillationV2 the loss function is a plain MSE loss between the teacher and student embeddings (after projecting the student features to the teacher space).
In DistillationV1 both the teacher and student embeddings are projected on a queue of teacher embeddings (from previous batches). The queue serves as a pseudo-classification layer. After applying a softmax to the resulting logits we obtain two distributions and use the KL-divergence loss to enforce consistency between the two. This is again inspired by this paper.
So far users typically trained there custom models on their own data, but we could/should consider shipping small encoders if there is demand for that.
@stegmuel Thank you for your detailed response! That answered all of my questions.
I am sure there is demand for compact encoders (~5Million params or less) since the smallest encoder in DINOv2 and v3 have more than 20M params. Looking forward to the release of your pretrained encoders!
Hi @yutong-xiang-97 I noticed that you are distilling from the CLS token's embedding outputted by DINOv3. I was wondering if we can choose to distill using the entire feature map, instead of just using the embedding of CLS token. I was thinking to distill my CNN's backbone (C,H,W) using DINOv3 feature map (embedding_dim, H,W) as the teacher
Hi @yutong-xiang-97 I noticed that you are distilling from the CLS token's embedding outputted by DINOv3. I was wondering if we can choose to distill using the entire feature map, instead of just using the embedding of CLS token. I was thinking to distill my CNN's backbone (C,H,W) using DINOv3 feature map (embedding_dim, H,W) as the teacher
The entire feature map is used by default (check distillationv2.py, v2 is the default). However, in our tests with YOLO models we have seen better performance when distilling from the CLS tokens than from the feature maps. This is likely because the features from DINOv3 are relatively uniform over the whole feature map which allows the student model to predict the mean feature quite easily. Features are uniform in DINOv3 because it uses register tokens. We have observed the same behavior when comparing DINOv2 with and without register tokens. With register tokens: Distillation from CLS token works well. Without register tokens: Distillation from feature maps works better than CLS. PR for distilling DINOv3 from CLS tokens will be up next week.
I guess a combination of CLS and feature maps might work well. This is also how DINOv3 multi-distillation for the ConvNext model works. Will be interesting to see how well it works with models like YOLO. When we tried it with DINOv2 we got worse results with CLS and feature maps than just with feature maps. Will keep you posted :)
@guarin in the distillationv2.py file, in the get_teacher method, teacher_embedding_model.eval() is being called, which means when this part is called, it only returns the embedding of the CLS token, not the entire feature map. Or am I missing something here?
Also, I am curious, which version of YOLO model did you distill with DINOv3?
You are right that forward returns only CLS tokens when the model is in eval mode. But we call get_intermediate_layers which always returns feature maps.
Also, I am curious, which version of YOLO model did you distill with DINOv3?
YOLO11
Got it, thanks for sharing that. Curious to know, where to able to get a stronger/comparable backbone with this kind of distillation compared to imagnet pretrained backbone?
Yes it is much stronger than ImageNet pretrained
@guarin not directly relevant to our previous conversion, but have you experimented with distilling DINO models into YOLO models that output an FPN based backbone? How would you go about distilling knowledge into these, when you no longer have a feature map but rather several maps of different scales? One idea that I had was to use DINO feature_maps from different blocks for this
Yes you can definitely do this. We actually use features from multiple DINO blocks for distillation.
It would be really awesome to be able to modify and use multi-channel images with DinoV3 and do knowledge distillation for edge device deployment.
Hi @rudro12356, thank you for your reply! We've already supported multi-channel input: https://docs.lightly.ai/train/stable/data/multi_channel.html.
For KD & edge deployment, we currently support distilling DINOv3 into YOLO models, which you can deploy on edge with e.g. Ultralytics. We're working on more edge support, so stay tuned!
Hi @yutong-xiang-97 ,
Apologies for any confusion in my previous message. To clarify, I would like to propose a feature: modifying the DINO models to natively accept multi-channel input.
I understand the current limitation, as stated in the documentation: "Multi-channel input is not supported for direct distillation because the DINOv2/v3 teacher models expect 3-channel input. However, you could load n-channel images and then reduce them to 3-channels with the ChannelDrop augmentation." I recognize that the models were originally trained on RGB images, but enabling native multi-channel support would be a significant enhancement for a wider range of applications.
Best, Rudro, R