FeatUp icon indicating copy to clipboard operation
FeatUp copied to clipboard

DINOv2 patch size is written as 16 instead of 14 ?

Open tcourat opened this issue 1 year ago • 8 comments

Hi,

First of all thanks for your job.

I noticed that according to the code, the patch_size from DINOv2 is expected to be 16 : https://github.com/mhamilton723/FeatUp/blob/c04e4c19945ce3e98a5488be948c7cc1fdcdacc6/featup/featurizers/util.py#L27

However, to my knowledge, unlike DINO and conventional ViT, the pretrained DINOv2 model uses 14x14 patches , not 16x16 patches. It is explicitly explained here : https://github.com/facebookresearch/dinov2/blob/main/MODEL_CARD.md and in the DINOv2 paper as well. Could you clarify this ?

To me this seems a reliquate of work about the DINOv2Featurizer class, since some lines are commented, but in the end the patch size (and every other input parameter) is not used and the small DINOv2 is loaded directly from torchhub (and it uses 14x14 patches) torchhub.https://github.com/mhamilton723/FeatUp/blob/c04e4c19945ce3e98a5488be948c7cc1fdcdacc6/featup/featurizers/DINOv2.py#L433

Also, it would be very useful to have upsamplers for larger DINOv2 models and not only the small one (having base, large and giant variant would be nice :+1: ).

tcourat avatar Mar 21 '24 14:03 tcourat

fixes this typo thank you and i will try to make some larger versions in the next few weeks!

mhamilton723 avatar Mar 26 '24 21:03 mhamilton723

fixes this typo thank you and i will try to make some larger versions in the next few weeks!

Could you please provide an estimated release date for the larger version of DINOv2? I'm eager to utilize the more advanced model.

avaxiao avatar Mar 29 '24 19:03 avaxiao

Do you mean using bigger backbones like DinoV2 ViT L/G, I am also eager to use those models if that's the case.

vcadillog avatar Mar 29 '24 19:03 vcadillog

Do you mean using bigger backbones like DinoV2 ViT L/G, I am also eager to use those models if that's the case.

Yes, I am interested in DINOv2 ViT L

avaxiao avatar Mar 29 '24 19:03 avaxiao

Hello, I'm also using the DINOv2 model. I've noticed that the resolution of hr_feat after upsampling is still 16 times larger than lr_feat, not 14 times. I suspect error may come from the forward process of the class JUBStack in ./featup/upsamplers.py, where self.upsample is called four times, resulting in an upsampling of 2^4=16 times on the low-resolution image. Perhaps it can be solved by building a self.upsample_new of a different size to achieve a 14x magnification factor and then training for new pretarined weights? I really admire your work and look forward to adapting to other larger versions of the DINOv2.

JustinXu0 avatar Apr 03 '24 08:04 JustinXu0

Hello, I'm also using the DINOv2 model. I've noticed that the resolution of hr_feat after upsampling is still 16 times larger than lr_feat, not 14 times. I suspect error may come from the forward process of the class JUBStack in ./featup/upsamplers.py, where self.upsample is called four times, resulting in an upsampling of 2^4=16 times on the low-resolution image. Perhaps it can be solved by building a self.upsample_new of a different size to achieve a 14x magnification factor and then training for new pretarined weights? I really admire your work and look forward to adapting to other larger versions of the DINOv2.

I made the same observations. Currently, it seems only possible to upsample by a factor x2 with JBU , multiples times (so x2 x4 x8 x16). It's not possible to do x14, unless you first upsample to x16 and then downsample to x14 (with a bilinear interpolation for instance).

tcourat avatar Apr 03 '24 08:04 tcourat

@mhamilton723 any plans on DINOv2 ViT-L version?

miquel-espinosa avatar Dec 18 '24 09:12 miquel-espinosa

is there a checkpoint for DINOv2-Base?

wren93 avatar Jan 05 '25 21:01 wren93