dinov2
dinov2 copied to clipboard
Is this the right way to do inference?
I presume I don't need Normalize?

Not sure if its correct, but hope it helps
import torch from PIL import Image import torchvision.transforms as T import hubconf
dinov2_vits14 = hubconf.dinov2_vits14()
img = Image.open('meta_dog.png')
transform = T.Compose([ T.Resize(224), T.CenterCrop(224), T.ToTensor(), T.Normalize(mean=[0.5], std=[0.5]), ])
img = transform(img)[:3].unsqueeze(0)
with torch.no_grad(): features = dinov2_vits14(img, return_patches=True)[0]
print(features.shape) import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA
pca = PCA(n_components=3) pca.fit(features)
pca_features = pca.transform(features) pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min()) pca_features = pca_features * 255
plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8)) plt.savefig('meta_dog_features.png')
In dinov2/models/vision_transformer.py line 290 add
def forward(self, *args, is_training=False, return_patches=False, **kwargs): ret = self.forward_features(*args, **kwargs) if is_training: return ret elif return_patches: return ret["x_norm_patchtokens"] else: return self.head(ret["x_norm_clstoken"])
input:

visualized features:

@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.
@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.
Thanks! This is what I used:
image_transforms = T.Compose([
T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
Let me know if that's wrong though.
Not sure if its correct, but hope it helps
import torch from PIL import Image import torchvision.transforms as T import hubconf
dinov2_vits14 = hubconf.dinov2_vits14()
img = Image.open('meta_dog.png')
transform = T.Compose([ T.Resize(224), T.CenterCrop(224), T.ToTensor(), T.Normalize(mean=[0.5], std=[0.5]), ])
img = transform(img)[:3].unsqueeze(0)
with torch.no_grad(): features = dinov2_vits14(img, return_patches=True)[0]
print(features.shape) import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA
pca = PCA(n_components=3) pca.fit(features)
pca_features = pca.transform(features) pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min()) pca_features = pca_features * 255
plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8)) plt.savefig('meta_dog_features.png')
In dinov2/models/vision_transformer.py line 290 add
def forward(self, *args, is_training=False, return_patches=False, **kwargs): ret = self.forward_features(*args, **kwargs) if is_training: return ret elif return_patches: return ret["x_norm_patchtokens"] else: return self.head(ret["x_norm_clstoken"])
input:
visualized features:
I found this helpful, but I would say instead of needing to modify the forward function, you can just do dino.forward_features(x)["x_norm_patchtokens"] yourself directly.
@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.
Thanks! This is what I used:
image_transforms = T.Compose([ T.Resize(256, interpolation=T.InterpolationMode.BICUBIC), T.CenterCrop(224), T.ToTensor(), T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)), ])Let me know if that's wrong though.
What you are doing is correct. What you get with the forward method is the CLS token. If you'd like the patch tokens, you can use forward_features, as noted by @jjennings955
@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.
Thanks! This is what I used:
image_transforms = T.Compose([
T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),T.CenterCrop(224),T.ToTensor(),T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),])
Let me know if that's wrong though.
What you are doing is correct. What you get with the
forwardmethod is the CLS token. If you'd like the patch tokens, you can useforward_features, as noted by @jjennings955
I think what I want is an embedding like CLIP that contains the features/understanding of the image. Is that what I'd get from forward_features?
If this is like DINO, any of the two features could be used as an image embedding.
Edit: You can see here how it is done in knn.py and log_regression.py, by simply calling model(samples).float():
https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/utils.py#L122
See:
https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/knn.py#L260-L264
https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/log_regression.py#L277-L279
https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/utils.py#L114-L122
Please note that linear.py adopts a different approach.
https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/utils.py#L42-L44
See:
https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/linear.py#L503-L507
https://github.com/facebookresearch/dinov2/blob/fc49f49d734c767272a4ea0e18ff2ab8e60fc92d/dinov2/eval/utils.py#L39-L45
It was also the case with DINO:
- https://github.com/facebookresearch/dino/issues/72
You could also do fancier stuff, e.g. "concatenate [CLS] token and GeM pooled patch tokens", as with DINO's copy detection.
How about this??
img = Image.open('') transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) input_tensor = transform(img) input_batch = input_tensor.unsqueeze(0).cuda() with torch.no_grad(): output =dinov2_vits14.get_intermediate_layers(input_batch)
the output is a tuple of intermediate feature maps. Then you can select which features you want from the tuple, and then you can try K-means etc etc
Yes, get_intermediate_layers() allows different approaches. This is similar to what is done in linear.py as mentioned above.
You could also use GeM pooled patch tokens with this output, as in eval_copy_detection.py for DINO (v1).
Sounds like this is all I need to do to get a features embedding: dino_emb = dinov2_vitg14(t_img.unsqueeze(0))
Closing as this seems resolved (and using #53 to keep track of documentation needs on feature extraction).
hello, How to train nearest neighbors model on extracted embeddings of images from different classes of folders using dinov2 model and retrieve nearest similar image for query image ? I tried below approach using sklearn nearest neighbors
import torch
from sklearn.neighbors import NearestNeighbors
import pickle
from PIL import Image
import torchvision.transforms as T
import os
# import hubconf
import tqdm
from tqdm import tqdm_notebook
device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
print('device:',device)
# dinov2_vits14 = hubconf.dinov2_vits14()
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')
dinov2_vits14.to(device)
def extract_features(filename):
img = Image.open(filename)
transform = T.Compose([
T.Resize(224),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=[0.5], std=[0.5]),
])
img = transform(img)[:3].unsqueeze(0)
with torch.no_grad():
features = dinov2_vits14(img.to('cuda'))[0]
# print(features.shape)
return features.numpy()
extensions = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG']
def get_file_list(root_dir):
file_list = []
for root, directories, filenames in os.walk(root_dir):
for filename in filenames:
if any(ext in filename for ext in extensions):
filepath = os.path.join(root, filename)
if os.path.exists(filepath):
file_list.append(filepath)
else:
print(filepath)
return file_list
def extract_features(filename):
img = Image.open(filename)
transform = T.Compose([
T.Resize(224),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=[0.5], std=[0.5]),
])
img = transform(img)[:3].unsqueeze(0)
with torch.no_grad():
features = dinov2_vits14(img.to('cuda'))[0]
# print(features.shape)
return features.cpu().numpy()
extensions = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG']
def get_file_list(root_dir):
file_list = []
for root, directories, filenames in os.walk(root_dir):
for filename in filenames:
if any(ext in filename for ext in extensions):
filepath = os.path.join(root, filename)
if os.path.exists(filepath):
file_list.append(filepath)
else:
print(filepath)
return file_list
# # path to the your datasets
root_dir = 'image_folder'
filenames = sorted(get_file_list(root_dir))
print('Total files :', len(filenames))
feature_list = []
for i in tqdm.tqdm(range(len(filenames))):
feature_list.append(extract_features(filenames[i]))
pickle.dump(feature_list,open('dino-all-feature-list.pickle','wb'))
pickle.dump(filenames,open('dino-all-filenames.pickle','wb'))
neighbors = NearestNeighbors(n_neighbors=5, algorithm='brute',metric='euclidean').fit(feature_list)
# Save the model to a file
with open('dino-all-neighbors2.pkl', 'wb') as f:
pickle.dump(neighbors, f)
with above dinov2 based trained model i get around 70% accuracy on testing data for retrieving similar class images, is there a way to improve my approach in better manner to improvise the accuracy ??
First, for k-NN classification, have a look at knn.py.
Second, after a quick look at your code, I would suggest to try a different metric, e.g. cosine instead of euclidean.
Third, I believe you should use a different image pre-processing (cf. transform in your code). Copy the one used for DINOv2.
For further question, I would suggest to create a separate Github issue for this purpose.
hey thanks, i will look into it. where can I find the cv.transform used for DINOv2 one?
hey thanks, i will look into it. where can I find the cv.transform used for DINOv2 one?
It is mentioned above: https://github.com/facebookresearch/dinov2/issues/2#issuecomment-1512068038
https://github.com/facebookresearch/dinov2/blob/c3c2683a13cde94d4d99f523cf4170384b00c34c/dinov2/data/transforms.py#L86-L90
It is similar to what you did but some values may differ, e.g.:
- resizing to 256 resolution before center-cropping at 224 resolution,
https://github.com/facebookresearch/dinov2/blob/c3c2683a13cde94d4d99f523cf4170384b00c34c/dinov2/data/transforms.py#L80-L84
- normalizing with different mean and std.
https://github.com/facebookresearch/dinov2/blob/c3c2683a13cde94d4d99f523cf4170384b00c34c/dinov2/data/transforms.py#L43-L44
Not sure if its correct, but hope it helps
import torch from PIL import Image import torchvision.transforms as T import hubconf
dinov2_vits14 = hubconf.dinov2_vits14()
img = Image.open('meta_dog.png')
transform = T.Compose([ T.Resize(224), T.CenterCrop(224), T.ToTensor(), T.Normalize(mean=[0.5], std=[0.5]), ])
img = transform(img)[:3].unsqueeze(0)
with torch.no_grad(): features = dinov2_vits14(img, return_patches=True)[0]
print(features.shape) import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA
pca = PCA(n_components=3) pca.fit(features)
pca_features = pca.transform(features) pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min()) pca_features = pca_features * 255
plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8)) plt.savefig('meta_dog_features.png')
In dinov2/models/vision_transformer.py line 290 add
def forward(self, *args, is_training=False, return_patches=False, **kwargs): ret = self.forward_features(*args, **kwargs) if is_training: return ret elif return_patches: return ret["x_norm_patchtokens"] else: return self.head(ret["x_norm_clstoken"])
input:
visualized features:
how to visualize feature like this ?? , I tried as below
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
test_img = r"image.png"
features = extract_features_new(test_img)
pca = PCA(n_components=3)
pca.fit(features)
pca_features = pca.transform(features)
pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min())
pca_features = pca_features * 255
plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8))
with this i'm getting error
Output exceeds the size limit. Open the full output data in a text editor---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[133], line 11
8 features = extract_features_new(test_img)
10 pca = PCA(n_components=3)
---> 11 pca.fit(features)
13 pca_features = pca.transform(features)
14 pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min())
ValueError: Expected 2D array, got 1D array instead:
array=[ 0.48167408 -2.6765716 -1.8200531 ... -2.971799 1.1348227
-1.9918671 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
feature shape is 1024, how would i fix this ?
how to visualize feature like this ?? ,
- #23
- https://github.com/facebookresearch/dinov2/issues/45#issuecomment-1519813830
@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.
Hi, it seems that I can get feature embedding of [1, 256, 384] for an image, then I reshape it to [1, 16, 16, 384], I can get the visualized features. But, how can I get a feature map with a larger resolution because I wonna get finer info such as texture.
Hi @XiaominLi1997, Use Larger models.
- feat_dim = 384 # vits14
- feat_dim = 768 # vitb14
- feat_dim = 1024 # vitl14
- feat_dim = 1536 # vitg14
So, you can use Vitg14 & Also Increase Input Image size in Multiple of 14. Ex: 518pix( i.e 14patchsize * 37pixels). Hope this helps.
T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
Why do you need to center to (0.485,0.456,0.406)? Is anywhere mentioning this?
@ydove0324 this is standard imagenet mean used for training. It's a common practice.