dinov2
dinov2 copied to clipboard
two questions about semantic segmentation with dinov2
Hi, thanks for your great work! I have two question. First, I observed that in your paper, dinov2 pre-trained at high resolution performs better as the resolution goes higher, for instance from 512 resolution to 640 resolution. Does it mean the model can adapt to different resolutions for semantic segmentation? If so, is there any insight behind this phenomenon? It is easy for CNNs but not trivial for ViTs, to my knowledge. Second, how did you resize the input to multiple of 14? I saw that you defined the CenterPadding class in notebooks. So you use the padding rather than resizing? Does it have any impact on performance?
Resize_pos_embed function is not being used in here. What to do when pre-training resolution is inconsistent with downstream input resolution?