dinov2
dinov2 copied to clipboard
about model architecture design
two questions:
- I noticed in vitl14.yaml, you set:
dino:
head_n_prototypes: 131072
head_bottleneck_dim: 384
If I understand correctly, this is just a linear layer. Whats the reasoning behind the extreme ratio of input/output channels? Is it operating under the assumption that prototypes are trying to approximate a one-hot distribution, so it's 131072 bits vs 384 floating point numbers?
- Do you have plans on releasing the prototype heads? I'm trying to adapt the model to food domain by continuing from your released checkpoints and further doing SSL on recipe1M+ which has 14M images. Ideally I could directly resume from the teacher/student model. If I understand correctly, currently released weights are only for teacher backbones, right?