dinov2 icon indicating copy to clipboard operation
dinov2 copied to clipboard

Understanding the difference between CLS features vs PATCH features.

Open barbolo opened this issue 1 year ago • 6 comments

Hi, first of all, thanks for the great work with DinoV2.

Imagine that I want to find dogs (and their positions) among several images in my dataset.

  1. I have DINOv2 CLS features obtained from an image of a dog
  2. I have several DINOv2 patch features for each image in my dataset.

I can confirm that I'm able to find images with dogs in my dataset by calculating a similarity score (e.g. dot product) between CLS feature of the dog image and patch features for each image in dataset. It did work.

What I'm trying to find out is if this result is just a coincidence or if it is intentional for DinoV2. I've skimmed through the paper and couldn't find the answer.

Thank you.

barbolo avatar Sep 20 '23 20:09 barbolo

Coincidence I would say.

qasfb avatar Sep 21 '23 11:09 qasfb

I've seen this kind of result with different classes of images (animals, objects, plants). Maybe this result has emerged unintentionally?

barbolo avatar Sep 21 '23 12:09 barbolo

That's very much a possiblity; at no point we expect the CLS and patch tokens to align though !

qasfb avatar Sep 21 '23 13:09 qasfb

Hi @barbolo

i'm trying to understand the difference between cls token and patch features. Can you please point me to some materials? I know that cls tokens are used as embedding for classification for example, but patch features I don't know what can be used for?

Thanks

eric-vision-e avatar Mar 08 '24 15:03 eric-vision-e

@eric-vision-e you might take a look at some demos (dense matching, sparse matching) in the link below:

https://dinov2.metademolab.com/

barbolo avatar Mar 08 '24 15:03 barbolo

Hi @barbolo,

ok thanks. I understand now.

eric-vision-e avatar Mar 10 '24 18:03 eric-vision-e