dinov2 Is this the right way to do inference?

I presume I don't need Normalize?

CleanShot 2023-04-17 at 12 39 54

Apr 17 '23 19:04 Suhail

Not sure if its correct, but hope it helps

import torch from PIL import Image import torchvision.transforms as T import hubconf

dinov2_vits14 = hubconf.dinov2_vits14()

img = Image.open('meta_dog.png')

transform = T.Compose([ T.Resize(224), T.CenterCrop(224), T.ToTensor(), T.Normalize(mean=[0.5], std=[0.5]), ])

img = transform(img)[:3].unsqueeze(0)

with torch.no_grad(): features = dinov2_vits14(img, return_patches=True)[0]

print(features.shape) import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA

pca = PCA(n_components=3) pca.fit(features)

pca_features = pca.transform(features) pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min()) pca_features = pca_features * 255

plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8)) plt.savefig('meta_dog_features.png')

In dinov2/models/vision_transformer.py line 290 add

def forward(self, *args, is_training=False, return_patches=False, **kwargs): ret = self.forward_features(*args, **kwargs) if is_training: return ret elif return_patches: return ret["x_norm_patchtokens"] else: return self.head(ret["x_norm_clstoken"])

input: meta_dog

visualized features:

meta_dog_features

Apr 17 '23 20:04 Esbenthorius

@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.

Apr 17 '23 20:04 patricklabatut

@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.

Thanks! This is what I used:

image_transforms = T.Compose([
    T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])

Let me know if that's wrong though.

Apr 17 '23 21:04 Suhail

Not sure if its correct, but hope it helps

import torch from PIL import Image import torchvision.transforms as T import hubconf

dinov2_vits14 = hubconf.dinov2_vits14()

img = Image.open('meta_dog.png')

transform = T.Compose([ T.Resize(224), T.CenterCrop(224), T.ToTensor(), T.Normalize(mean=[0.5], std=[0.5]), ])

img = transform(img)[:3].unsqueeze(0)

with torch.no_grad(): features = dinov2_vits14(img, return_patches=True)[0]

print(features.shape) import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA

pca = PCA(n_components=3) pca.fit(features)

pca_features = pca.transform(features) pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min()) pca_features = pca_features * 255

plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8)) plt.savefig('meta_dog_features.png')

In dinov2/models/vision_transformer.py line 290 add

def forward(self, *args, is_training=False, return_patches=False, **kwargs): ret = self.forward_features(*args, **kwargs) if is_training: return ret elif return_patches: return ret["x_norm_patchtokens"] else: return self.head(ret["x_norm_clstoken"])

input:

visualized features:

I found this helpful, but I would say instead of needing to modify the forward function, you can just do dino.forward_features(x)["x_norm_patchtokens"] yourself directly.

Apr 17 '23 22:04 jjennings955

@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.

Thanks! This is what I used:
image_transforms = T.Compose([
    T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
Let me know if that's wrong though.

What you are doing is correct. What you get with the forward method is the CLS token. If you'd like the patch tokens, you can use forward_features, as noted by @jjennings955

Apr 18 '23 12:04 TimDarcet

@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.

Thanks! This is what I used:
image_transforms = T.Compose([
T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
Let me know if that's wrong though.

What you are doing is correct. What you get with the forward method is the CLS token. If you'd like the patch tokens, you can use forward_features, as noted by @jjennings955

I think what I want is an embedding like CLIP that contains the features/understanding of the image. Is that what I'd get from forward_features?

Apr 18 '23 13:04 Suhail

If this is like DINO, any of the two features could be used as an image embedding.

Edit: You can see here how it is done in knn.py and log_regression.py, by simply calling model(samples).float():

https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/utils.py#L122

See:

https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/knn.py#L260-L264

https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/log_regression.py#L277-L279

https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/utils.py#L114-L122

Apr 18 '23 13:04 woctezuma

Please note that linear.py adopts a different approach.

https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/utils.py#L42-L44

See:

https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/linear.py#L503-L507

https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/utils.py#L39-L45

It was also the case with DINO:

https://github.com/facebookresearch/dino/issues/72

You could also do fancier stuff, e.g. "concatenate [CLS] token and GeM pooled patch tokens", as with DINO's copy detection.

Apr 18 '23 13:04 woctezuma

How about this??

img = Image.open('') transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) input_tensor = transform(img) input_batch = input_tensor.unsqueeze(0).cuda() with torch.no_grad(): output =dinov2_vits14.get_intermediate_layers(input_batch)

the output is a tuple of intermediate feature maps. Then you can select which features you want from the tuple, and then you can try K-means etc etc

Apr 19 '23 02:04 Elsword016

Yes, get_intermediate_layers() allows different approaches. This is similar to what is done in linear.py as mentioned above.

You could also use GeM pooled patch tokens with this output, as in eval_copy_detection.py for DINO (v1).

Apr 19 '23 09:04 woctezuma

Sounds like this is all I need to do to get a features embedding: dino_emb = dinov2_vitg14(t_img.unsqueeze(0))

Apr 21 '23 22:04 Suhail

Closing as this seems resolved (and using #53 to keep track of documentation needs on feature extraction).

Apr 24 '23 22:04 patricklabatut

hello, How to train nearest neighbors model on extracted embeddings of images from different classes of folders using dinov2 model and retrieve nearest similar image for query image ? I tried below approach using sklearn nearest neighbors

import torch
from sklearn.neighbors import NearestNeighbors 
import pickle
from PIL import Image
import torchvision.transforms as T
import os 
# import hubconf
import tqdm
from tqdm import tqdm_notebook
device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
print('device:',device)
# dinov2_vits14 = hubconf.dinov2_vits14()
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')
dinov2_vits14.to(device)
def extract_features(filename):
    img = Image.open(filename)

    transform = T.Compose([
    T.Resize(224),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=[0.5], std=[0.5]),
    ])

    img = transform(img)[:3].unsqueeze(0)

    with torch.no_grad():
        features = dinov2_vits14(img.to('cuda'))[0]

    # print(features.shape)
    return features.numpy()

extensions = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG']

def get_file_list(root_dir):
    file_list = []
    for root, directories, filenames in os.walk(root_dir):
        for filename in filenames:
            if any(ext in filename for ext in extensions):
                filepath = os.path.join(root, filename)
                if os.path.exists(filepath):
                  file_list.append(filepath)
                else:
                  print(filepath)
    return file_list
def extract_features(filename):
    img = Image.open(filename)

    transform = T.Compose([
    T.Resize(224),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=[0.5], std=[0.5]),
    ])

    img = transform(img)[:3].unsqueeze(0)

    with torch.no_grad():
        features = dinov2_vits14(img.to('cuda'))[0]

    # print(features.shape)
    return features.cpu().numpy()

extensions = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG']

def get_file_list(root_dir):
    file_list = []
    for root, directories, filenames in os.walk(root_dir):
        for filename in filenames:
            if any(ext in filename for ext in extensions):
                filepath = os.path.join(root, filename)
                if os.path.exists(filepath):
                  file_list.append(filepath)
                else:
                  print(filepath)
    return file_list

# # path to the your datasets
root_dir = 'image_folder' 
filenames = sorted(get_file_list(root_dir))
print('Total files :', len(filenames))
feature_list = []
for i in tqdm.tqdm(range(len(filenames))):
    feature_list.append(extract_features(filenames[i]))
pickle.dump(feature_list,open('dino-all-feature-list.pickle','wb'))
pickle.dump(filenames,open('dino-all-filenames.pickle','wb'))
neighbors = NearestNeighbors(n_neighbors=5, algorithm='brute',metric='euclidean').fit(feature_list)
# Save the model to a file
with open('dino-all-neighbors2.pkl', 'wb') as f:
    pickle.dump(neighbors, f)

with above dinov2 based trained model i get around 70% accuracy on testing data for retrieving similar class images, is there a way to improve my approach in better manner to improvise the accuracy ??

May 03 '23 06:05 aaiguy

First, for k-NN classification, have a look at knn.py.

Second, after a quick look at your code, I would suggest to try a different metric, e.g. cosine instead of euclidean.

Third, I believe you should use a different image pre-processing (cf. transform in your code). Copy the one used for DINOv2.

For further question, I would suggest to create a separate Github issue for this purpose.

May 03 '23 08:05 woctezuma

hey thanks, i will look into it. where can I find the cv.transform used for DINOv2 one?

May 03 '23 10:05 aaiguy

hey thanks, i will look into it. where can I find the cv.transform used for DINOv2 one?

It is mentioned above: https://github.com/facebookresearch/dinov2/issues/2#issuecomment-1512068038

https://github.com/facebookresearch/dinov2/blob/c3c2683a13cde94d4d99f523cf4170384b00c34c/dinov2/data/transforms.py#L86-L90

It is similar to what you did but some values may differ, e.g.:

resizing to 256 resolution before center-cropping at 224 resolution,

https://github.com/facebookresearch/dinov2/blob/c3c2683a13cde94d4d99f523cf4170384b00c34c/dinov2/data/transforms.py#L80-L84

normalizing with different mean and std.

https://github.com/facebookresearch/dinov2/blob/c3c2683a13cde94d4d99f523cf4170384b00c34c/dinov2/data/transforms.py#L43-L44

May 03 '23 10:05 woctezuma

Not sure if its correct, but hope it helps

import torch from PIL import Image import torchvision.transforms as T import hubconf

dinov2_vits14 = hubconf.dinov2_vits14()

img = Image.open('meta_dog.png')

transform = T.Compose([ T.Resize(224), T.CenterCrop(224), T.ToTensor(), T.Normalize(mean=[0.5], std=[0.5]), ])

img = transform(img)[:3].unsqueeze(0)

with torch.no_grad(): features = dinov2_vits14(img, return_patches=True)[0]

print(features.shape) import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA

pca = PCA(n_components=3) pca.fit(features)

pca_features = pca.transform(features) pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min()) pca_features = pca_features * 255

plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8)) plt.savefig('meta_dog_features.png')

In dinov2/models/vision_transformer.py line 290 add

def forward(self, *args, is_training=False, return_patches=False, **kwargs): ret = self.forward_features(*args, **kwargs) if is_training: return ret elif return_patches: return ret["x_norm_patchtokens"] else: return self.head(ret["x_norm_clstoken"])

input:

visualized features:

how to visualize feature like this ?? , I tried as below

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA


test_img = r"image.png"

features = extract_features_new(test_img)

pca = PCA(n_components=3)
pca.fit(features)

pca_features = pca.transform(features)
pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min())
pca_features = pca_features * 255

plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8))

with this i'm getting error

Output exceeds the size limit. Open the full output data in a text editor---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[133], line 11
      8 features = extract_features_new(test_img)
     10 pca = PCA(n_components=3)
---> 11 pca.fit(features)
     13 pca_features = pca.transform(features)
     14 pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min())
     
     
ValueError: Expected 2D array, got 1D array instead:
array=[ 0.48167408 -2.6765716  -1.8200531  ... -2.971799    1.1348227
 -1.9918671 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

feature shape is 1024, how would i fix this ?

May 05 '23 10:05 aaiguy

how to visualize feature like this ?? ,

#23
https://github.com/facebookresearch/dinov2/issues/45#issuecomment-1519813830

May 05 '23 11:05 woctezuma

@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.

Hi, it seems that I can get feature embedding of [1, 256, 384] for an image, then I reshape it to [1, 16, 16, 384], I can get the visualized features. But, how can I get a feature map with a larger resolution because I wonna get finer info such as texture.

May 08 '23 01:05 LemonTwoL

Hi @XiaominLi1997, Use Larger models.

feat_dim = 384 # vits14
feat_dim = 768 # vitb14
feat_dim = 1024 # vitl14
feat_dim = 1536 # vitg14

So, you can use Vitg14 & Also Increase Input Image size in Multiple of 14. Ex: 518pix( i.e 14patchsize * 37pixels). Hope this helps.

May 13 '23 14:05 purnasai

T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),

Why do you need to center to (0.485,0.456,0.406)? Is anywhere mentioning this?

Apr 17 '24 14:04 ydove0324

@ydove0324 this is standard imagenet mean used for training. It's a common practice.

Apr 22 '24 04:04 charchit7

dinov2 dinov2 copied to clipboard

Is this the right way to do inference?

dinov2
dinov2 copied to clipboard