dinov2 DINOv2 is now available in HF Transformers (with tutorial)

Hi folks,

As there are multiple issues here regarding fine-tuning DINOv2 on custom data, questions related to semantic segmentation/depth estimation, image similarity and feature extraction etc. this should now become easier given the model is available in HF Transformers. Check below for tips and tricks.

Documentation: https://huggingface.co/docs/transformers/main/model_doc/dinov2

The checkpoints are on the hub: https://huggingface.co/models?other=dinov2

I've created a tutorial notebook on training a linear classifier using DINOv2's frozen features for semantic segmentation. The notebook would be very similar for image classification or depth estimation.

Image classification

For image classification, the easiest is to use the Dinov2ForImageClassification class available in the Transformers library. You can then just follow the example notebooks or example scripts.

Semantic segmentation

Refer to my demo notebook here: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DINOv2. One just places a linear classifier on top of the model, and uses the features as-is.

Depth estimation:

DPT + DINOv2 is now supported, a notebook has been made available here.
For fine-tuning, one would however use a different loss function, like this one used in the GLPN model to predict the loss between logits and ground truth depth maps. I would recommend this guide which goes over fine-tuning.

Feature extraction

Feature extraction is also very simple:

from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
model = AutoModel.from_pretrained('facebook/dinov2-base')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
features = outputs.last_hidden_state

The features in this case will be a PyTorch tensor of shape (batch_size, num_image_patches, embedding_dim). So one can turn them into a single vector by averaging over the image patches, like so:

features = features.mean(dim=1)

Now you have a single 768-dim (or other sizes, depending on which one you are using) vector for each image in your batch.

Getting intermediate features

Intermediate features can be obtained in 2 ways:

by passing output_hidden_states=True to the forward method of the code snippet above. The outputs will then contain an additional key called hidden_states, which contain the intermediate features for each of the Transformer layers.
by leveraging the Dinov2Backbone class.

Image similarity

We have a tutorial on that here: https://huggingface.co/blog/image-similarity. Given that DINOv2 now is available in HF Transformers, one can simply replace the model_ckpt in the blog with the ones of DINOv2 on the 🤗 hub.

Can be relevant for #6, #14, #15, #25, #46, #47, #54, #55, #80, #84, #97, #99

Have fun fine-tuning them!

Cheers,

Niels

Aug 03 '23 09:08 NielsRogge

I have a silly question, sorry to ask here. For the hidden_states, I want to convert (batch_size, num_image_patches, embedding_dim) to (batch_size, h, w, embedding_dim) for segmentation tasks. But I found for a (224, 224) image, the num_image_patches is 257 (not 16x16=256). What is the correct way to reshape it?

For torch.hub, it is a provided function.

encoder = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
hidden_states = encoder.get_intermediate_layers(pixel_values, out_indices, reshape=True)
-> (b, 1024, h/14, w/14)

And I browser the function in the repo, found it not that easy to adapt directly to the outputs.hidden_states for huggingface model.

Aug 15 '23 06:08 Starlento

Hi @Starlento great question! This is because DINOv2 (and vision transformers in general) typically also add a special CLS token before the sequence of image patches. Hence the sequence length becomes (image_size/patch_size)**2 + 1. So in case you use a DINOv2 model with an image resolution of 224 and a patch size of 16, you get (224/16)**2 + 1 = 257 embeddings out.

Hence one usually discards the final embedding of the CLS token, and only uses the embeddings of the image patches, as done here. I think it makes sense to add a Dinov2Backbone class to the Transformers library, in similar spirit of other backbones. I've made a PR above for that.

Here's how you can use it (for now you'll need to do pip install git+https://github.com/nielsrogge/transformers.git@add_dinov2_backbone):

from transformers import Dinov2Backbone
import torch

model = Dinov2Backbone.from_pretrained("facebook/dinov2-base", out_indices=[0,1,2,3])

pixel_values = torch.randn(1, 3, 224, 224)

outputs = model(pixel_values)

for feature_map in outputs.feature_maps:
    print(feature_map.shape)

By default, feature maps will be 4D i.e. of shape (batch_size, num_channels, height, width). If you want 3D feature maps, just pass in reshape=False to the from_pretrained method.

Aug 15 '23 10:08 NielsRogge

I've created a tutorial notebook on training a linear classifier using DINOv2's frozen features for semantic segmentation. The notebook would be very similar for image classification or depth estimation.

Hi,

Thank you a lot for the tutorial and the hf version. I ran it on colab. I have two questions:

is the add_pooling_layer=False necessary? if set to false, will one pooling layer be added? because i don't think I came across this parameter in the class of the config/model from HF


Error below: 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-37-433acb205ffe>](https://localhost:8080/#) in <cell line: 1>()
----> 1 model = Dinov2ForSemanticSegmentation.from_pretrained("facebook/dinov2-base", id2label=id2label, num_labels=len(id2label))

1 frames
[<ipython-input-36-e8c7a63af851>](https://localhost:8080/#) in __init__(self, config)
     23     super().__init__(config)
     24 
---> 25     self.dinov2 = Dinov2Model(config, add_pooling_layer=False)
     26     self.classifier = LinearClassifier(config.hidden_size, 32, 32, config.num_labels)
     27 

TypeError: Dinov2Model.__init__() got an unexpected keyword argument 'add_pooling_layer'

the rest of the code runs except for training it does train for a while but I guess it eventually encounters a datapoint it can't convert it's tested on the same dataset as in your example, no code modifications done


Error below: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-35-c5ede5803258>](https://localhost:8080/#) in <cell line: 19>()
     19 for epoch in range(epochs):
     20   print("Epoch:", epoch)
---> 21   for idx, batch in enumerate(tqdm(train_dataloader)):
     22       pixel_values = batch["pixel_values"].to(device)
     23       labels = batch["labels"].to(device)

8 frames
[/usr/local/lib/python3.10/dist-packages/albumentations/core/composition.py](https://localhost:8080/#) in _check_args(self, **kwargs)
    284 
    285         if self.is_check_shapes and shapes and shapes.count(shapes[0]) != len(shapes):
--> 286             raise ValueError(
    287                 "Height and Width of image, mask or masks should be equal. You can disable shapes check "
    288                 "by setting a parameter is_check_shapes=False of Compose class (do it only if you are sure "

ValueError: Height and Width of image, mask or masks should be equal. You can disable shapes check by setting a parameter is_check_shapes=False of Compose class (do it only if you are sure about your data consistency).

I could comment out is_check_shapes but im thinking that this would affect some images from being converted as the model requires them

Aug 15 '23 13:08 rainbowpuffpuff

@rainbowpuffpuff thanks for reporting, I've removed add_pooling_layer recently, so no need to pass that. I've updated my notebook.

regarding the second question, looks like Albumentations says there's an image which has a segmentation mask with a different shape, weird, haven't encountered that. Will rerun the notebook to verify

Aug 15 '23 13:08 NielsRogge

Hi @Starlento great question! This is because DINOv2 (and vision transformers in general) typically also add a special CLS token before the sequence of image patches. Hence the sequence length becomes (image_size/patch_size)**2 + 1. So in case you use a DINOv2 model with an image resolution of 224 and a patch size of 16, you get (224/16)**2 + 1 = 257 embeddings out.

Hence one usually discards the final embedding of the CLS token, and only uses the embeddings of the image patches, as done here. I think it makes sense to add a Dinov2Backbone class to the Transformers library, in similar spirit of other backbones. I've made a PR above for that.

Here's how you can use it (for now you'll need to do pip install git+https://github.com/nielsrogge/transformers.git@add_dinov2_backbone):
from transformers import Dinov2Backbone
import torch

model = Dinov2Backbone.from_pretrained("facebook/dinov2-base", out_indices=[0,1,2,3])

pixel_values = torch.randn(1, 3, 224, 224)

outputs = model(pixel_values)

for feature_map in outputs.feature_maps:
    print(feature_map.shape)
By default, feature maps will be 4D i.e. of shape (batch_size, num_channels, height, width). If you want 3D feature maps, just pass in reshape=False to the from_pretrained method.

I found the reshape is somewhat wrong?

from transformers import Dinov2Backbone
import torch

encoder = Dinov2Backbone.from_pretrained("hf-base-models/facebook_dinov2-large", out_features=["stage6", "stage12", "stage18", "stage24"])
picked_hidden_states = encoder(torch.rand(1, 3, 448, 224)).feature_maps
for x in picked_hidden_states:
    print(x.shape)
torch.Size([1, 1024, 16, 32])
torch.Size([1, 1024, 16, 32])
torch.Size([1, 1024, 16, 32])
torch.Size([1, 1024, 16, 32])

I used to use only square images that I did not find the problem... The problem is might be

        for stage, hidden_state in zip(self.stage_names, hidden_states):
            if stage in self.out_features:
                if self.config.apply_layernorm:
                    hidden_state = self.layernorm(hidden_state)
                if self.config.reshape_hidden_states:
                    batch_size, _, height, width = pixel_values.shape
                    patch_size = self.config.patch_size
                    hidden_state = hidden_state[:, 1:, :].reshape(
                        batch_size, width // patch_size, height // patch_size, -1
                    )
                    hidden_state = hidden_state.permute(0, 3, 1, 2).contiguous()
                feature_maps += (hidden_state,)

  hidden_state = hidden_state[:, 1:, :].reshape(
      batch_size, width // patch_size, height // patch_size, -1
  )

should place height in front of width?

Oct 25 '23 06:10 Starlento

Hi @Starlento,

That's indeed a bug in the original implementation, I'm addressed it in https://github.com/huggingface/transformers/pull/26092

Oct 25 '23 07:10 NielsRogge

does anybody see an example of using Mask2Former as a head?

Oct 25 '23 22:10 GLASS-z13

is_check_shapes

@rainbowpuffpuff thanks for reporting, I've removed add_pooling_layer recently, so no need to pass that. I've updated my notebook.

regarding the second question, looks like Albumentations says there's an image which has a segmentation mask with a different shape, weird, haven't encountered that. Will rerun the notebook to verify

Looks like at least one of the images in the dataset is transposed. As a quick hack, In the SegmentationDataset class, I added the following and with it, I could train: if original_image.shape[:2] != original_segmentation_map.shape: original_image = np.transpose(original_image, (1,0,2)) print("Transposed and continuing") print("Original image "+str(original_image.shape))

Mar 14 '24 09:03 mfarre

Hi folks,

As there are multiple issues here regarding fine-tuning DINOv2 on custom data, questions related to semantic segmentation/depth estimation, image similarity and feature extraction etc. this should now become easier given the model is available in HF Transformers. Check below for tips and tricks.

Documentation: https://huggingface.co/docs/transformers/main/model_doc/dinov2

The checkpoints are on the hub: https://huggingface.co/models?other=dinov2

I've created a tutorial notebook on training a linear classifier using DINOv2's frozen features for semantic segmentation. The notebook would be very similar for image classification or depth estimation.

Semantic segmentation/image classification/depth estimation

Refer to my demo notebook here: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DINOv2. One just places a linear classifier on top of the model, and uses the features as-is.

Depth estimation:
* DPT + DINOv2 is now supported, a notebook has been made available [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DPT/Inference_with_DPT_%2B_DINOv2_for_depth_estimation.ipynb).

* For fine-tuning, one would however use a different loss function, like [this one](https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/glpn/modeling_glpn.py#L765-L766) used in the [GLPN model](https://huggingface.co/docs/transformers/model_doc/glpn#transformers.GLPNForDepthEstimation) to predict the loss between logits and ground truth depth maps.
Image classification: here the linear classifier can look simpler, you can just use Dinov2ForImageClassification. Refer to this notebook or example scripts.

Feature extraction

Feature extraction is also very simple:
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
model = AutoModel.from_pretrained('facebook/dinov2-base')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
features = outputs.last_hidden_state
The features in this case will be a PyTorch tensor of shape (batch_size, num_image_patches, embedding_dim). So one can turn them into a single vector by averaging over the image patches, like so:
features = features.mean(dim=1)
Now you have a single 768-dim (or other sizes, depending on which one you are using) vector for each image in your batch.

Getting intermediate features

Intermediate features can be easily obtained by passing output_hidden_states=True to the forward method of the code snippet above. The outputs will then contain an additional key called hidden_states, which contain the intermediate features for each of the Transformer layers.

Image similarity

We have a tutorial on that here: https://huggingface.co/blog/image-similarity. Given that DINOv2 now is available in HF Transformers, one can simply replace the model_ckpt in the blog with the ones of DINOv2 on the 🤗 hub.

Can be relevant for #6, #14, #15, #25, #46, #47, #54, #55, #80, #84, #97, #99

Have fun fine-tuning them!

Cheers,

Niels

Hi @NielsRogge , thanks for the detailed information. Then, according to your explanation, doing:

outputs = model(**inputs)
outputs.last_hidden_state

should be exactly the same as doing:

outputs = model(**inputs, output_hidden_states=True)
outputs.hidden_states[-1]

In my case I find that these 2 chunks of code return completely different tensors, so I am not sure which one corresponds to the final CLS and patch embeddings at the end of the Transformer. If you could clarify it would be great.

Thanks!

Mar 26 '24 11:03 alvaro-stylesage

In my case I find that these 2 chunks of code return completely different tensors, so I am not sure which one corresponds to the final CLS and patch embeddings at the end of the Transformer. If you could clarify it would be great.

Yes that's because there's layernorm applied on the last hidden states as seen here.

Mar 26 '24 11:03 NielsRogge

In my case I find that these 2 chunks of code return completely different tensors, so I am not sure which one corresponds to the final CLS and patch embeddings at the end of the Transformer. If you could clarify it would be great.

Yes that's because there's layernorm applied on the last hidden states as seen here.

Thanks!).

Thanks for the answer. So when using DinoV2 as a feature extractor for images, is it better to take the embeddings after applying LayerNorm or before?

Thanks!

Mar 26 '24 12:03 alvaro-stylesage

It's mostly a matter of experimentation, I would just try out both and see which ones work best.

Mar 28 '24 08:03 NielsRogge

@NielsRogge, just for the sake of clarify, to account for the output tensor of the second dimension (image_size / patch_size)^2 + 1, from what I read and understood from the model card, the model image patch size is of 14 pixels and not 16 pixels as you mentioned here, so that with an image resolution of 224 and a patch size of 14, you get (224 / 14)^2 + 1 = 257 embeddings out. Thank you so much for your work! ;)

Mar 28 '24 19:03 edouardmercier

Could you reproduce any of the paper's results?

May 04 '24 08:05 franchesoni

@franchesoni I ported the weigths to the HF format, to reproduce the results I'd recommend the scripts present in the original repository. We do have image classification scripts here: https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification, but they are mostly for demo purposes, and need to be tweaked for a specific use case.

May 04 '24 11:05 NielsRogge

@NielsRogge thank you for your amazing contribution! please can you tell me more about the "rescale_factor" parameter, why do we need it and why does it have to take on a value of 0.00392156862745098? I did not find the corresponding piece of code from the official repo of dinov2, can you point it out?

May 15 '24 15:05 zhaoyanpeng

Hi @zhaoyanpeng that value comes from 1/255. Typically the red, green and blue color channels of images have values between 0 and 255. Neural networks on the other hand are typically trained on numbers between 0 and 1. So rescaling is a kind of standardization step. The original repository uses ToTensor from torchvision which does the same thing.

May 15 '24 16:05 NielsRogge

Hi @zhaoyanpeng that value comes from 1/255. Typically the red, green and blue color channels of images have values between 0 and 255. Neural networks on the other hand are typically trained on numbers between 0 and 1. So rescaling is a kind of standardization step. The original repository uses ToTensor from torchvision which does the same thing.

Ah, it now makes sense. thnk you for your prompt reply!

May 16 '24 01:05 zhaoyanpeng

Hi, does any have such issue like this? key error dinov2? It's with transformer==4.30.2 and timm==0.9.12 All downloaded from https://hf-mirror.com/facebook/dinov2-base/tree/main

Jul 01 '24 02:07 Yonggie

DINOv2 was probably added in a later version of Transformers, so pip install --upgrade transformers will fix that.

Jul 01 '24 07:07 NielsRogge

@NielsRogge thank you for the amazing works, I have a question regarding image similarity calculation, should I take the mean value of the last_hidden_state for each image in total of 2 images to compute

emb_img1, emb_img2 = last_hidden_states[0].mean(dim=0), last_hidden_states[1].mean(dim=0)
metric = F.cosine_similarity(emb_img1, emb_img2, dim=0)

or
emb_img1, emb_img2 = last_hidden_states[0, 0], last_hidden_states[1, 0]  # Get cls token (0-th token) for each img

the second line of code successfully replicated the result of this paper: https://openaccess.thecvf.com/content/CVPR2023/supplemental/Ruiz_DreamBooth_Fine_Tuning_CVPR_2023_supplemental.pdf but the first line failed.

Aug 08 '24 09:08 Sundragon1993

Hi,

It depends a bit, some models have a CLS token which is specifically trained in a contrastive way like CLIP or SigLIP, so it's advised to use that. Other models work better by average pooling the final hidden state of the patch tokens.

So I would try both approaches and see which one works best.

Aug 08 '24 13:08 NielsRogge

dinov2 dinov2 copied to clipboard

DINOv2 is now available in HF Transformers (with tutorial)

Image classification

Semantic segmentation

Depth estimation:

Feature extraction

Getting intermediate features

Image similarity

Semantic segmentation/image classification/depth estimation

Feature extraction

Getting intermediate features

Image similarity

dinov2
dinov2 copied to clipboard