dinov2 Whether different archs are supported for student and teacher?

When building models for teacher and student in this code, the parameter args.arch is used for both student and teacher.

As written in the paper, smaller models are distilled from the largest model (a frozen teacher). How to achieve this through the above code?

Aug 25 '23 10:08 AgentEXPL

The code does not allow distillation at the moment, you would need to hack a little bit inside, such that you:

initialize and freeze a teacher along with its heads (which we don't provide)
train a student with this teacher
apply the masked-image-modeling loss to all the tokens (not only the subset that is masked)
track an EMA of the student for evaluation

Sep 01 '23 12:09 qasfb

Thanks for the reply ! I am also interested in this. So no plans to add this missing code ?

Sep 06 '23 09:09 ykaganov

Thanks! @qasfb I'm wondering do we need the Dino loss except the masked-image-modeling loss during the distillation?

Sep 08 '23 16:09 memoiry

Hi @qasfb

* apply the masked-image-modeling loss to all the tokens (not only the subset that is masked)

Could you explain why the MiM loss needs to be applies to all the tokens in this case?

Jul 16 '24 11:07 amundra15