confusion about feature shape in salad forward ?

Open chennuo0125-HIT opened this issue 8 months ago • 1 comments

dinov2's output should have N features, so i think feature shape should be [B, N, C, H // 14, W // 14] ?

Apr 18 '25 08:04 chennuo0125-HIT

Hi @chennuo0125-HIT

The N features from DINOv2 are actually the H // 14 * W // 14. For every patch of the input image DINOv2 returns a C vector, so it returns (for every image) a [C, N] tensor, which is the same as [C, H // 14, W // 14]

Jul 18 '25 08:07 serizba