dinov2
dinov2 copied to clipboard
Some Questions about model distillation
Thanks for your work! I have some questions about model distillation. "we leverage the same training loop with a few exceptions: we use a larger model as a frozen teacher, keep a spare EMA of the student that we use as our final model, remove the masking and stochastic depth, and, apply the iBOT loss on the two global crops." In Paper.
- I can only get vit-g backbone pretrained model. "frozen teacher" means whether include "dino head" and "ibot head"?
- what does "keep a spare EMA of the student" means? student model parameters are update with ema? student and teacher are not the same model.
- If you look at the default training config for ViT-G, a separate head is used for iBOT (two heads: DINO head and iBOT head). The Frozen teacher should include both of these frozen heads since this is distillation and you want the joint embedding to not change.
- Keep a spare EMA of the student is essentially creating a copy of the student and updating it by exponential moving average at a certain frequency. This can be updated in the same way that the teacher is updated in the training code dinov2/train/ssl_meta_arch.py.
Note: The distillation code was not included in this repository. You cannot use ssl_meta_arch.py to do distillation as is. You would need to modify it to include the student EMA, and to load different models for teacher and student. You would also need to create a method to update the EMA similar to the way the frozen teacher is updated (this deals with collecting all the gradients in FSDP). I have created a fork of this repository with some distillation code which can be found here
- If you look at the default training config for ViT-G, a separate head is used for iBOT (two heads: DINO head and iBOT head). The Frozen teacher should include both of these frozen heads since this is distillation and you want the joint embedding to not change.
- Keep a spare EMA of the student is essentially creating a copy of the student and updating it by exponential moving average at a certain frequency. This can be updated in the same way that the teacher is updated in the training code dinov2/train/ssl_meta_arch.py.
Note: The distillation code was not included in this repository. You cannot use ssl_meta_arch.py to do distillation as is. You would need to modify it to include the student EMA, and to load different models for teacher and student. You would also need to create a method to update the EMA similar to the way the frozen teacher is updated (this deals with collecting all the gradients in FSDP). I have created a fork of this repository with some distillation code which can be found here
Thanks for your great work! Did u reproduce their distillation result?
@usryokousha Thanks for your great work! Have you reproduced their distillation result?
@usryokousha your code has a little error when using copy.deepcopy on a ModuleDict with PyTorch models.
(<class 'RuntimeError'>, RuntimeError('Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment. If you were attempting to deepcopy a module, this may be because of a torch.nn.utils.weight_norm usage, see https://github.com/pytorch/pytorch/pull/103001'), <traceback object at 0x7f11a4044480>)
self.student = nn.ModuleDict(student_model_dict) self.teacher = nn.ModuleDict(teacher_model_dict) self.student_shadow = copy.deepcopy(self.student) # This line causes the error
How we can fix?
As for the bug: RuntimeError('Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment.
I refer this link: https://github.com/pytorch/pytorch/issues/102981
I change the dino_head.py: from torch.nn.utils import weight_norm -> from torch.nn.utils.parametrizations import weight_norm
This requires the torch version >=2.1.0. Besides, I comment the line self.last_layer.weight_g.data.fill_(1)