salad
salad copied to clipboard
confusion about feature shape in salad forward ?
dinov2's output should have N features, so i think feature shape should be [B, N, C, H // 14, W // 14] ?
Hi @chennuo0125-HIT
The N features from DINOv2 are actually the H // 14 * W // 14. For every patch of the input image DINOv2 returns a C vector, so it returns (for every image) a [C, N] tensor, which is the same as [C, H // 14, W // 14]