Whether different archs are supported for student and teacher?
When building models for teacher and student in this code, the parameter args.arch is used for both student and teacher.
As written in the paper, smaller models are distilled from the largest model (a frozen teacher). How to achieve this through the above code?
The code does not allow distillation at the moment, you would need to hack a little bit inside, such that you:
- initialize and freeze a teacher along with its heads (which we don't provide)
- train a student with this teacher
- apply the masked-image-modeling loss to all the tokens (not only the subset that is masked)
- track an EMA of the student for evaluation
Thanks for the reply ! I am also interested in this. So no plans to add this missing code ?
Thanks! @qasfb I'm wondering do we need the Dino loss except the masked-image-modeling loss during the distillation?
Hi @qasfb
* apply the masked-image-modeling loss to all the tokens (not only the subset that is masked)
Could you explain why the MiM loss needs to be applies to all the tokens in this case?