dinov2 icon indicating copy to clipboard operation
dinov2 copied to clipboard

Clarifications on High-Resolution Adaptation

Open JihwanEom opened this issue 1 year ago • 4 comments

Hello,

I'm want to clarify regarding the high-resolution adaptation described in the paper. As per section 4 and Appendix B.2, it's mentioned that the model was trained at a higher resolution (from 224 to 518) over 10k iterations. However, I couldn't find the related codes in this repository.

  • Section 4 states:

    "Adapting the resolution (Touvron et al., 2019). Increasing image resolution is key to pixel-level downstream tasks such as segmentation or detection, where small objects disappear at low resolutions. However, training at high resolution is time and memory demanding, and instead, we increase the resolution of images to 518 × 518 during a short period at the end of pretraining."

  • Appendix B.2 mentions:

    "We initialise the model with the pretrained weights then train it for 10k iterations with the same procedure as the original pretraining. All the schedules are kept the same as in the original training, but compressed to fit in 10k iterations. All the hyperparameters are kept the same as in the first pretraining, except the base learning rate which is reduced."

  1. Code Availability: Are the high-resolution adaptation code lines not included in the repository's release?
  2. Details on "compressed to fit": Could you tell me about the details about "compressed to fit"? (probably it may be the answer for third question)
  3. Batch Size & Learning Rate: It would be very helpful if you could provide the specifics regarding the batch size and learning rate used during this high-resolution adaptation phase.

Thank you in advance!

JihwanEom avatar Sep 09 '23 15:09 JihwanEom

I think you can just use the existing code as is. You should notice that the crop size for the ViT-G variant is 518 which means that it is interpolating the positioning embedding during the training. You can just modify the configuration to 518 from 224 for the last 10k iterations. The included configs in this repository are for ImageNet1K and ImageNet22K, not for the bigger internal dataset in the paper. The schedulers will automatically compress the exponential decay / increase as a function of the number of steps.

usryokousha avatar Sep 11 '23 03:09 usryokousha

Thank you @usryokousha for the clarification. Could you please point where the interpolation in the position embedding is happening? I cannot use the 224 model with 518 input or vise-versa.

I get the same error as #316 when I try to do that.

zshn25 avatar Jan 30 '24 09:01 zshn25

I am quite sure the distillation-based models are all trained with a base context length of (224 x 224) + 1.  It is only the ViT-Giant model that is trained with (518 x 518) + 1.  You should encounter a weight mismatch when loading other distillation-based models with a high resolution input.  (Assuming you are referring to fine-tuning)For pre-training you shouldn’t have a problem.  The interpolation of the position embedding is here:   https://github.com/facebookresearch/dinov2/blob/2302b6bf46953431b969155307b9bed152754069/dinov2/models/vision_transformer.py#L179나의 iPhone에서 보냄2024. 1. 30. 오후 6:35, Zeeshan Khan Suri @.***> 작성: Thank you @usryokousha for the clarification. Could you please point where the interpolation in the position embedding is happening? I cannot use the 224 model with 518 input or vise-versa

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

usryokousha avatar Jan 31 '24 14:01 usryokousha

+1 I would also like clarification on this if possible :)

anadodik avatar Mar 04 '24 17:03 anadodik