dinov2 icon indicating copy to clipboard operation
dinov2 copied to clipboard

DINOv2 is now available in HF Transformers (with tutorial)

Open NielsRogge opened this issue 11 months ago • 18 comments

Hi folks,

As there are multiple issues here regarding fine-tuning DINOv2 on custom data, questions related to semantic segmentation/depth estimation, image similarity and feature extraction etc. this should now become easier given the model is available in HF Transformers. Check below for tips and tricks.

Documentation: https://huggingface.co/docs/transformers/main/model_doc/dinov2

The checkpoints are on the hub: https://huggingface.co/models?other=dinov2

I've created a tutorial notebook on training a linear classifier using DINOv2's frozen features for semantic segmentation. The notebook would be very similar for image classification or depth estimation.

Semantic segmentation/image classification/depth estimation

Refer to my demo notebook here: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DINOv2. One just places a linear classifier on top of the model, and uses the features as-is.

Depth estimation:

  • DPT + DINOv2 is now supported, a notebook has been made available here.
  • For fine-tuning, one would however use a different loss function, like this one used in the GLPN model to predict the loss between logits and ground truth depth maps.

Image classification: here the linear classifier can look simpler, you can just use Dinov2ForImageClassification. Refer to this notebook or example scripts.

Feature extraction

Feature extraction is also very simple:

from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
model = AutoModel.from_pretrained('facebook/dinov2-base')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
features = outputs.last_hidden_state

The features in this case will be a PyTorch tensor of shape (batch_size, num_image_patches, embedding_dim). So one can turn them into a single vector by averaging over the image patches, like so:

features = features.mean(dim=1)

Now you have a single 768-dim (or other sizes, depending on which one you are using) vector for each image in your batch.

Getting intermediate features

Intermediate features can be easily obtained by passing output_hidden_states=True to the forward method of the code snippet above. The outputs will then contain an additional key called hidden_states, which contain the intermediate features for each of the Transformer layers.

Image similarity

We have a tutorial on that here: https://huggingface.co/blog/image-similarity. Given that DINOv2 now is available in HF Transformers, one can simply replace the model_ckpt in the blog with the ones of DINOv2 on the 🤗 hub.

Can be relevant for #6, #14, #15, #25, #46, #47, #54, #55, #80, #84, #97, #99

Have fun fine-tuning them!

Cheers,

Niels

NielsRogge avatar Aug 03 '23 09:08 NielsRogge