dinov2
dinov2 copied to clipboard
Understanding the difference between CLS features vs PATCH features.
Hi, first of all, thanks for the great work with DinoV2.
Imagine that I want to find dogs (and their positions) among several images in my dataset.
- I have DINOv2 CLS features obtained from an image of a dog
- I have several DINOv2 patch features for each image in my dataset.
I can confirm that I'm able to find images with dogs in my dataset by calculating a similarity score (e.g. dot product) between CLS feature
of the dog image and patch features
for each image in dataset. It did work.
What I'm trying to find out is if this result is just a coincidence or if it is intentional for DinoV2. I've skimmed through the paper and couldn't find the answer.
Thank you.
Coincidence I would say.
I've seen this kind of result with different classes of images (animals, objects, plants). Maybe this result has emerged unintentionally?
That's very much a possiblity; at no point we expect the CLS and patch tokens to align though !
Hi @barbolo
i'm trying to understand the difference between cls token and patch features. Can you please point me to some materials? I know that cls tokens are used as embedding for classification for example, but patch features I don't know what can be used for?
Thanks
@eric-vision-e you might take a look at some demos (dense matching, sparse matching) in the link below:
https://dinov2.metademolab.com/
Hi @barbolo,
ok thanks. I understand now.