dinov2 On the question of knowledge distillation

Hello, I would like to ask, where is the method of training this code with a distillation algorithm? Because I see that the network of teachers and students is the same, so there is no distillation, right?

May 21 '24 07:05 yuyu19970716

I also have such doubts, but the open source code should use the method of self-distillation, now I also hope the author can answer, thank you very much.

May 30 '24 10:05 FuHTong

You can take a look at this link, it should be the implementation of the distillation algorithm, but I haven't trained it yet, if you train successfully, please notify me, thank you! https://github.com/usryokousha/dinov2/blob/main/README.md

Jun 05 '24 02:06 yuyu19970716

I'm also working on training model on the custom dataset. There is a nice tutorial about how dino model training with teacher/student network and the demo code to practice. thanks
https://github.com/clint-kristopher-morris/DINO_concise/blob/main/notebooks_/Concise_DINO-Demo.ipynb

Jun 20 '24 06:06 khpanb

@yuyu19970716 Thank you very much! At present, I have done some work on downstream tasks, and found that the pre-training part may have a great impact on downstream tasks, for example, the accuracy of things that are not in the data set may not be as good as that of cnn. In the future, I will also seriously consider this problem, and by the way, I will see if it is possible to freeze the teacher network and train only small models,and by the way, I will see if it is possible to freeze the teacher network and train only small models

Jun 20 '24 06:06 FuHTong

@khpanb Thank you for your open source project, I have starrd, seriously learn

Jun 20 '24 07:06 FuHTong

Hello, I would like to ask, where is the method of training this code with a distillation algorithm? Because I see that the network of teachers and students is the same, so there is no distillation, right?

To make matters confusing, the main training loop is a form of knowledge distillation but with identical architectures. After, there is a separate knowledge distillation step used for model compression. This code is not public, only the model weights. For this distillation-for-compression step, they use a large model for the teacher and a small model for the student, plus a few other differences (see https://github.com/facebookresearch/dinov2/issues/176#issuecomment-1702666128).

Jul 23 '24 12:07 crypdick

Hello, I would like to ask, where is the method of training this code with a distillation algorithm? Because I see that the network of teachers and students is the same, so there is no distillation, right?

To make matters confusing, the main training loop is a form of knowledge distillation but with identical architectures. After, there is a separate knowledge distillation step used for model compression. This code is not public, only the model weights. For this distillation-for-compression step, they use a large model for the teacher and a small model for the student, plus a few other differences (see #176 (comment)).

I think this is not completely accurate. They said in another issue that they don't release all the necessary weights to perform the distillation from the ViT-G, citing that head weights (presumably the task heads) are NOT released. So even if the code supported it, which it doesn't, you'd need to recreate the ViT-G just to get the head weights for downstream teacher-student distillation. https://github.com/facebookresearch/dinov2/issues/176#issuecomment-1702666128

With these head weights not being released and the code only supporting the self-distillation (not student-teacher), it's really not remotely practical to reproduce any of the smaller models in the way published and at the highest levels of performance. Self-distillation is the only option to create those smaller models, which they show is worse in their paper.

One stinky aspect of all these missing components (missing weights, code to reproduce the smaller models in the same way as published, huge private dataset) is the community can't really distill into other architectures which could lead to even more adoption of DINOv2 into even more applications. ViTs are not perfectly well-suited for every single application.

Jul 23 '24 14:07 JBartholomewMN

@JBartholomewMN Help me understand. Their main training script checkpoints both the dino_head and ibot_head for the teacher. Isn't that all you need to compute the dino loss and ibot loss during distillation? What's stopping someone from modifying the train script as the paper described (frozen teacher, removing stochastic depth, etc.)? They even have a reference implementation for EMA weights here.

I'm not interested in replicating their paper, I just want to distill my own feature extractor. I feel your pain for not having a full release, and hint hint the paper authors probably agree with you. They have no authority on the matter, all the big AI labs go through a clearance process.

Jul 23 '24 18:07 crypdick