dinov3 icon indicating copy to clipboard operation
dinov3 copied to clipboard

How to Fine-Tune DINOv3 on Custom Dataset with Self-Supervised Learning Locally

Open MaybeRichard opened this issue 4 months ago • 26 comments

Hi team, I'm new to DINO and want to fine-tune DINOv3 using self-supervised learning on my custom dataset. Hardware: 8 NVIDIA 4090 GPU. Goal: Continue training from pre-trained checkpoints. Use the model for feature extraction and downstream tasks on medical images. Dataset details:

  • ~200k medical 2D images, mostly grayscale (plan to convert to RGB by duplicating channels).
  • Various sizes/resolutions, formats (JPEG/PNG).

Questions:

  • Recommended model size (e.g., ViT-B/16) for ~200k images?
  • Training command example starting from checkpoints in repo?

MaybeRichard avatar Aug 15 '25 13:08 MaybeRichard

Hello and thanks for the interest.

Before trying to fine-tune the backbone, can you try extracting the features from the frozen backbone and evaluating the downstream performance? Just in case it already works well.

If you want to fine-tune the model you can follow the steps outlined in the Training section of the readme: https://github.com/facebookresearch/dinov3?tab=readme-ov-file#training. A few notes:

  • Start from the the ViT-L checkpoint, make sure it's loaded before starting to train.
  • Use the ViT-L configuration vitl_im1k_lin834.yaml.
  • Adapt the resolution to something that makes sense for your data and tasks. E.g. if it's segmentation you should start training at 512 and possibly increase to 768.
  • Original img format (jpg/png) doesn't matter, as long as PIL can load them.

baldassarreFe avatar Aug 15 '25 14:08 baldassarreFe

Hi! I have a similar question.

I’d like to continue pretraining from one of the shared checkpoints (e.g., dinov3_vits16_pretrain_lvd1689m-*.pth).

  1. Which argument should I use to resume? a) Should I use student.resume_from_teacher_chkpt for this, or is that flag only for the high-res adaptation stage? b) If it’s only for High-resolution adaptation, then to continue pretraining I assume I should resume from a full training checkpoint (with optimizer/EMA). However, the public files seem to be backbone-only. c) When I try resume_from_teacher_chkpt with the public file, loading looks like it expects a top-level "teacher" key (e.g., ckpt["teacher"] in init_fsdp_model_from_checkpoint), which those public weights don’t have.

  2. What’s the intended usage of student.pretrained_weights? a) Should this be used to initialize a new pretraining run from the public backbone? b) Is student.pretrained_weights considered deprecated?

  3. Recommended strategy for domain adaptation via continued training (no optimizer state available, etc) Since the public files don’t include optimizer state, last LR, last iter, etc., what config tweaks do you recommend to safely continue training on a new domain (medical images)? For example: start with a lower LR (how much lower than your default?) etc.

Thanks!

halecakir avatar Aug 15 '25 15:08 halecakir

Thanks a lot!. By the way, could you please clarify what range of loss values is typically expected for DINOv3 training? In other words, around what loss level would you consider the model to be training well, assuming the dataset and setup are reasonable?

MaybeRichard avatar Aug 16 '25 07:08 MaybeRichard

@baldassarreFe thank you so much for allowing SSL fine-tuning this time! I can't wait to try it.

alexisdrakopoulos avatar Aug 16 '25 10:08 alexisdrakopoulos

@MaybeRichard have you managed to get this up and running? I am looking to SSL fine tune the base model on 3 million images soon but I'm very confused by how to start from the checkpoint and the configs I need.

alexisdrakopoulos avatar Aug 16 '25 17:08 alexisdrakopoulos

@alexisdrakopoulos Yes, but there still have some bugs and I'm trying to fix, maybe be we can discuss furthur about how to make this work (DM me with email or others)

MaybeRichard avatar Aug 16 '25 23:08 MaybeRichard

@MaybeRichard I've emailed you. I won't be able to test this until later this week. But I definitely want to try fine tune using SSL for my out of domain dataset.

I still need to do some work setting up benchmarks though!

alexisdrakopoulos avatar Aug 17 '25 06:08 alexisdrakopoulos

In my experience, training from scratch with domain-specific dataset performed better than fine-tuning the (general) pretrained model.

sehunfromdaegu avatar Aug 19 '25 07:08 sehunfromdaegu

@sehunfromdaegu but in this case the training procedure is quite slow / difficult due to the 7b model size. You can't just train the distilled models from scratch on a small dataset right? Then you might as well use dinov2? Or am I wrong

alexisdrakopoulos avatar Aug 19 '25 07:08 alexisdrakopoulos

@sehunfromdaegu but in this case the training procedure is quite slow / difficult due to the 7b model size. You can't just train the distilled models from scratch on a small dataset right? Then you might as well use dinov2? Or am I wrong

You are right; it is quite slow and difficult to train a large model. We typically don't have very large domain-specific datasets or enough GPU power.

My experience is that pretraining a smaller model like ViT-B or ViT-L on domain-specific data outperforms a larger, general-purpose pretrained model like ViT-H. Honestly, I haven't tried fine-tuning ViT-G or ViT-7B, but I would guess that a smaller, domain-specific model would still work better. But you have to carefully choose the pretraining method, as SOTA (like DINOv2/DINOv3) pretraining approach may not work best for your specific domain.

sehunfromdaegu avatar Aug 19 '25 07:08 sehunfromdaegu

@sehunfromdaegu I have up to 6 or 7 million image / text pairs.

I tried fine-tuning SigLIP2 with 3 million but I had a lot of trouble with the performance stalling after 1 or 2 epochs, even with extremely low LRs.

Would you recommend trying to train a model from scratch? I only have access to 1xH200 for now and only for up to around 1 day which is frustrating.

alexisdrakopoulos avatar Aug 19 '25 07:08 alexisdrakopoulos

@sehunfromdaegu sorry just to elaborate.

When you talk about training a model from scratch what procedure are you using? The dinov2? Dinov3? Clip? Siglip? There are so many ways to train these ViT backends I can't decide what will work best.

The data augmentation procedures from dinov2 are perfect for my use cases. Maybe I need to look into it's training procedure along with CLIP models.

alexisdrakopoulos avatar Aug 19 '25 07:08 alexisdrakopoulos

@sehunfromdaegu sorry just to elaborate.

When you talk about training a model from scratch what procedure are you using? The dinov2? Dinov3? Clip? Siglip? There are so many ways to train these ViT backends I can't decide what will work best.

The data augmentation procedures from dinov2 are perfect for my use cases. Maybe I need to look into its training procedure along with CLIP models.

A 1-day limit can be quite frustrating; perhaps implementing a way to resume from the latest checkpoint would be helpful.

I haven't actually tried pretraining with the DINOv2 framework. I believe its impressive performance comes partially from its specific data curation (and the size as well), which may not be suitable for my dataset.

Surprisingly, I found that a less popular pretraining method (VICRegL) worked better than other fancier approaches (e.g., Hiera) in a specific domain (but the number of pretraining samples were only about 1M). While hyperparameter choices may change the result, I now believe that success really depends on your dataset. Still, I suppose luck plays a pretty significant role as well.

sehunfromdaegu avatar Aug 19 '25 07:08 sehunfromdaegu

@sehunfromdaegu I'm trying to keep training cheap / lightweight for my use cases.

Maybe I should just try training dinov2 or v3 base models from scratch but it's very difficult. I saw dinov2 was trained successfully on sub 1m datasets with x-ray images.

It's annoying because I have good high quality text labels but I have had terrible experience trying to finetune SigLIP2. Maybe I need to try training from scratch.

alexisdrakopoulos avatar Aug 19 '25 08:08 alexisdrakopoulos

Hello, how are you? I never trained a personalized model with Dino, I have only worked with yolo and roboflow, does anyone guide me to use it from scratch with a classification model? I need to classify 8 types of Yerba, any example notebook out there? Thank you!!

coki0291 avatar Aug 21 '25 11:08 coki0291

Let me describe a few approaches I've tried. I initially used Dinov1 for pre-training, then fine-tuned all parameters. When the data size reached 1kw, the following patterns emerged, and these patterns persisted even when the data size reached 1e.

  1. When the teacher network entered the Sota model at 60%~70% epoch, fine-tuning yielded the best results. Freezing the backbone for downstream tasks also yielded good results.
  2. When the teacher network entered the Sota model at 80% epoch, fine-tuning yielded diminished results, but freezing the backbone for downstream tasks yielded even better results.
  3. The teacher network's performance in both cases degraded significantly at 100% epoch.

I've discovered two significant issues with dinov1. One is its requirement for large-scale data redundancy; converging is difficult with dirty data. The other is training instability, especially when training in FP16. This is also documented in the v1 code. I believe v2 addresses these two issues better. Dinov3 optimizes the third, preventing mode collapse in the later stages of SSL. If you want to fine-tune all model parameters, you can try adding dinov2's koleo_loss and DINOLoss to v1. This will make training more stable and offer some improvement. However, the most important thing is to have your own data cleaning pipeline.

HollrayChan avatar Aug 27 '25 08:08 HollrayChan

I got nan during training, I think it is because I loaded the model as float16?

weathon avatar Aug 29 '25 09:08 weathon

I got nan during training, I think it is because I loaded the model as float16?

It is very likely that (fp16 + self-attention) causes nan loss in my experience. NaN loss often disappears if when you just use fp32.

sehunfromdaegu avatar Aug 29 '25 12:08 sehunfromdaegu

Hi! I have a similar question.

I’d like to continue pretraining from one of the shared checkpoints (e.g., dinov3_vits16_pretrain_lvd1689m-*.pth).

  1. Which argument should I use to resume? a) Should I use student.resume_from_teacher_chkpt for this, or is that flag only for the high-res adaptation stage? b) If it’s only for High-resolution adaptation, then to continue pretraining I assume I should resume from a full training checkpoint (with optimizer/EMA). However, the public files seem to be backbone-only. c) When I try resume_from_teacher_chkpt with the public file, loading looks like it expects a top-level "teacher" key (e.g., ckpt["teacher"] in init_fsdp_model_from_checkpoint), which those public weights don’t have.
  2. What’s the intended usage of student.pretrained_weights? a) Should this be used to initialize a new pretraining run from the public backbone? b) Is student.pretrained_weights considered deprecated?
  3. Recommended strategy for domain adaptation via continued training (no optimizer state available, etc) Since the public files don’t include optimizer state, last LR, last iter, etc., what config tweaks do you recommend to safely continue training on a new domain (medical images)? For example: start with a lower LR (how much lower than your default?) etc.

Thanks!

Hi, I have the exact same question. I am also trying to load a pretrained weight and continue training, but it's not working. Thank you for asking this!

qqplot avatar Sep 02 '25 17:09 qqplot

I got nan during training, I think it is because I loaded the model as float16?

I found that when training vitb, if I set qkv_bias to true, nan values are prone to appear. When qkv_bias is set to false, the training becomes stable. By examining the official vitb weights, I noticed that both qkv.bias and qkv.bias_mask weights are all 0. Therefore, it is best to change qkv_bias to false in ssl_default_config.yaml.

1921134176 avatar Sep 03 '25 07:09 1921134176

Hi! I have a similar question.

I’d like to continue pretraining from one of the shared checkpoints (e.g., dinov3_vits16_pretrain_lvd1689m-*.pth).

  1. Which argument should I use to resume? a) Should I use student.resume_from_teacher_chkpt for this, or is that flag only for the high-res adaptation stage? b) If it’s only for High-resolution adaptation, then to continue pretraining I assume I should resume from a full training checkpoint (with optimizer/EMA). However, the public files seem to be backbone-only. c) When I try resume_from_teacher_chkpt with the public file, loading looks like it expects a top-level "teacher" key (e.g., ckpt["teacher"] in init_fsdp_model_from_checkpoint), which those public weights don’t have.
  2. What’s the intended usage of student.pretrained_weights? a) Should this be used to initialize a new pretraining run from the public backbone? b) Is student.pretrained_weights considered deprecated?
  3. Recommended strategy for domain adaptation via continued training (no optimizer state available, etc) Since the public files don’t include optimizer state, last LR, last iter, etc., what config tweaks do you recommend to safely continue training on a new domain (medical images)? For example: start with a lower LR (how much lower than your default?) etc.

Thanks!

This is a nice observation. I am dealing with the same problem, did you manage to figure it out ? Thanks

I did something like:

def build_model_from_cfg(cfg, only_teacher: bool = False): outputs = build_model( cfg.student, only_teacher=only_teacher, img_size=cfg.crops.global_crops_size if isinstance(cfg.crops.global_crops_size, int) else max(cfg.crops.global_crops_size), device="meta", ) if only_teacher: teacher, embed_dim = outputs # Load pretrained weights for teacher if specified if hasattr(cfg.student, 'pretrained_weights') and cfg.student.pretrained_weights: logger.info(f"Loading backbone pretrained weights for teacher from {cfg.student.pretrained_weights}") import torch state_dict = torch.load(cfg.student.pretrained_weights, map_location="cpu") # Remove any prefixes that might exist state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()} state_dict = {k.replace("backbone.", ""): v for k, v in state_dict.items()} missing_keys, unexpected_keys = teacher.load_state_dict(state_dict, strict=True, assign=True) logger.info(f"Loaded teacher pretrained weights with missing_keys: {len(missing_keys)}, unexpected_keys: {len(unexpected_keys)}") return teacher, embed_dim else: student, teacher, embed_dim = outputs # Load pretrained weights for student if specified if hasattr(cfg.student, 'pretrained_weights') and cfg.student.pretrained_weights: logger.info(f"Loading backbone pretrained weights for student from {cfg.student.pretrained_weights}") import torch state_dict = torch.load(cfg.student.pretrained_weights, map_location="cpu") # Remove any prefixes that might exist state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()} state_dict = {k.replace("backbone.", ""): v for k, v in state_dict.items()} missing_keys, unexpected_keys = student.load_state_dict(state_dict, strict=True, assign=True) logger.info(f"Loaded student pretrained weights with missing_keys: {len(missing_keys)}, unexpected_keys: {len(unexpected_keys)}") return student, teacher, embed_dim

in the dinov3/dinov3/models/init.py - slight modification - i believe that loading only the backbone is the correct approach, or maybe i am wrong? I also noticed that the loss converges, but degrades the quality of the embedding when training on medical imaging dataset.

Here is my config for vit_small continued training:

MODEL: META_ARCHITECTURE: SSLMetaArch DEVICE: cuda WEIGHTS: '' DTYPE: float32

compute_precision: param_dtype: bf16 reduce_dtype: fp32 sharding_strategy: SHARD_GRAD_OP

dino: loss_weight: 1.0 global_ignore_diagonal: true head_n_prototypes: 65536 # Adjust based on your dataset size and memory 8192 head_bottleneck_dim: 256 head_norm_last_layer: false head_nlayers: 3 head_hidden_dim: 2048 koleo_loss_weight: 0.1 koleo_loss_distributed: false koleo_topk: 1 koleo_distributed_replicas: 0 force_weight_norm: false

ibot: loss_weight: 1.0 mask_sample_probability: 0.5 mask_ratio_min_max:

  • 0.1
  • 0.5 mask_random_circular_shift: false force_masking_even_with_zero_weight: false separate_head: true head_n_prototypes: 65536 # Should match dino.head_n_prototypes 8192 head_bottleneck_dim: 256 head_norm_last_layer: false head_nlayers: 3 head_hidden_dim: 2048

train: batch_size_per_gpu: 32 # Adjust based on your GPU memory dataset_path: CustomTIFF:root=../Datasets/composite/ output_dir: ./output_custom_dinov3_flexible saveckp_freq: 20 seed: 42 num_workers: 32 # Adjust based on your CPU OFFICIAL_EPOCH_LENGTH: 1250 # Adjust based on dataset size (400) monitor_gradient_norm: false chunk_schedule: [] cache_dataset: false use_teacher_head: true learn_from_teacher_tokens: false centering: sinkhorn_knopp checkpointing: false compile: true # Disable for stability on some systems cudagraphs: false

student: arch: vit_small # Can be: vit_small, vit_base, vit_large, vit_huge, etc. patch_size: 16 drop_path_rate: 0.3 # Reduced dropout for smaller datasets layerscale: 1.0e-05 pretrained_weights: 'dinov3/dinov3/checkpoints/dinov3_vits16_pretrain_lvd1689m-08c60483.pth' ffn_layer: mlp ffn_ratio: 4.0 resume_from_teacher_chkpt: '' qkv_bias: true proj_bias: true ffn_bias: true norm_layer: layernorm n_storage_tokens: 4 mask_k_bias: true in_chans: 3 pos_embed_type: rope pos_embed_rope_base: 100.0 pos_embed_rope_dtype: bf16 fp8_enabled: False

teacher: momentum_teacher: 0.996 # Can adjust based on training dynamics final_momentum_teacher: 1 warmup_teacher_temp: 0.04 teacher_temp: 0.07 warmup_teacher_temp_epochs: 10 # Adjust based on total epochs in_chans: 3

optim: epochs: 100 # Adjust based on your needs optimizer: adamw weight_decay: 0.04 weight_decay_end: 0.2

lr: 0.0005 # Adjust based on batch size and dataset

lr: 0.001 # Adjust based on batch size and dataset warmup_epochs: 5 # Adjust based on total epochs min_lr: 1.0e-06 clip_grad: 3.0 freeze_last_layer_epochs: 1 scaling_rule: sqrt_wrt_1024 patch_embed_lr_mult: 0.2 dino_head_wd_multiplier: 1.0 layerwise_decay: 0.9 multi_tensor_optim: true adamw_beta1: 0.9 adamw_beta2: 0.999

crops: global_crops_scale:

  • 0.4
  • 1.0 local_crops_number: 8 local_crops_scale:
  • 0.05
  • 0.4 global_crops_size: 256 local_crops_size: 112 global_local_crop_pairs_ratios: 1.0 localcrops_subset_of_globalcrops: false share_color_jitter: false gram_teacher_crops_size: null horizontal_flips: true gram_teacher_no_distortions: false # If True, no distortions are applied to gram teacher crops

Standard ImageNet normalization - adjust if needed for your domain

rgb_mean:

  • 0.485
  • 0.456
  • 0.406 rgb_std:
  • 0.229
  • 0.224
  • 0.225

evaluation: eval_period_iterations: 500 # Frequency of evaluation low_freq_every: 2

Checkpoint management - keeps last N checkpoints to save disk space

checkpointing: period: 3750 # Save checkpoint every N iterations max_to_keep: 3 # Keep last 3 training checkpoints (saves disk space) max_eval_to_keep: 5 # Keep last 5 eval checkpoints (NEW!) - set to null to keep all keep_every: 99999999999999999 # Additionally save a permanent checkpoint every N iterations`

marjanstoimchev avatar Sep 03 '25 08:09 marjanstoimchev

Hey, trying on my side as well to train (without fine-tuning) a ViT-B/14 on 128 V100 GPUs with a total batch size of 2048; from experience we know that domain specific encoders are >> those trained on natural images (might be different though with Dinov3 👍 ).

In addition to the patch size, the only modified parameters are those ones:

  train.batch_size_per_gpu=16 \
  train.OFFICIAL_EPOCH_LENGTH=1075 \
  evaluation.eval_period_iterations=5000 \
  optim.lr=0.0005 \
  optim.scaling_rule=sqrt_wrt_2048 \
  optim.drop_path_rate=0.1 \
  optim.epochs=142 \
  optim.warmup_epochs=40 \
  teacher.warmup_teacher_temp_epochs=40 \

Training is stopped at iteration 6,000 with the following logs:

W20250822 10:20:07 3743140 dinov3 train.py:528] All-reduced metrics:
local_batch_size: 16.0
global_batch_size: 2048.0
dino_local_crops_loss: 10.209039688110352
dino_local_loss_weight: 1.0
dino_global_crops_loss: 10.042501449584961
koleo_loss: -0.27935218811035156
ibot_loss: nan
backbone_grad_norm: nan
dino_head_grad_norm: 23.661849975585938
ibot_head_grad_norm: nan

It seems that the culprit here is the ibot objective (different runs corresponding to decreased optim.lr, NaN happens later with the decrease of the learning rate but still). Will investigate more in details. Does that ring a bell @baldassarreFe ?

Image

To visualize those results:

import matplotlib.pyplot as plt
import pandas as pd

with open("training_metrics.json", "r") as f:
    metrics = pd.DataFrame([eval(x) for x in f.read().split("\n")[:-1]])
    
metric_names = metrics.columns[3:]

fig, axes = plt.subplots(5, 3, figsize=(15, 20))
for i, metric_name in enumerate(metric_names):
    ax = axes[i%5, i//5]
    ax.plot(metrics.iteration, metrics[metric_name])
    ax.set_title(metric_name)
    ax.set_xlabel("Iterations")
plt.subplots_adjust(hspace=0.5)
plt.show()

afilt avatar Sep 12 '25 10:09 afilt

Hello. In Section 5.2 “Model Distillation,” the document clearly states that the smaller models (ViT-Small, ViT-Base, ViT-Large) do not use Gram loss (i.e., the Gram-anchoring technique) during distillation. Specifically, in vitl_im1k_lin834.yaml Gram loss appears to be disabled (cfg.gram.use_loss is set to false). I’m therefore wondering: if I continue SSL pre-training from the dinov3-vitl16-pretrain-lvd1689m checkpoint, should Gram loss be turned on?

Asunatan avatar Sep 22 '25 09:09 Asunatan

I got nan during training, I think it is because I loaded the model as float16?

I found that when training vitb, if I set qkv_bias to true, nan values are prone to appear. When qkv_bias is set to false, the training becomes stable. By examining the official vitb weights, I noticed that both qkv.bias and qkv.bias_mask weights are all 0. Therefore, it is best to change qkv_bias to false in ssl_default_config.yaml.

Thank you for your answer about float16 training. I would like to ask whether using float16 in your experiments would lead to a performance degradation, as mentioned in https://github.com/facebookresearch/dinov3/issues/181.

Heart-eartH avatar Nov 03 '25 11:11 Heart-eartH

how can I train on an unlabelled image folder? [rank0]: File "", line 198, in _run_module_as_main [rank0]: File "", line 88, in run_code [rank0]: File "E:\models\update_dinov3\dinov3\dinov3\train\train.py", line 661, in [rank0]: main() [rank0]: File "E:\models\update_dinov3\dinov3\dinov3\train\train.py", line 657, in main [rank0]: do_train(cfg, model, resume=not args.no_resume) [rank0]: File "E:\models\update_dinov3\dinov3\dinov3\train\train.py", line 426, in do_train [rank0]: data_loader = build_multi_resolution_data_loader_from_cfg( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "E:\models\update_dinov3\dinov3\dinov3\train\train.py", line 369, in build_multi_resolution_data_loader_from_cfg [rank0]: loaders.append(build_data_loader_from_cfg(cfg=cfg_i, model=model, start_iter=start_iter)) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "E:\models\update_dinov3\dinov3\dinov3\train\train.py", line 311, in build_data_loader_from_cfg [rank0]: dataset = make_dataset( [rank0]: ^^^^^^^^^^^^^ [rank0]: File "E:\models\update_dinov3\dinov3\dinov3\data\loaders.py", line 102, in make_dataset [rank0]: class, kwargs = _parse_dataset_str(dataset_str) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "E:\models\update_dinov3\dinov3\dinov3\data\loaders.py", line 53, in _parse_dataset_str [rank0]: key, value = token.split("=") [rank0]: ^^^^^^^^^^ [rank0]: ValueError: not enough values to unpack (expected 2, got 1)

linglizhang-seu avatar Dec 01 '25 13:12 linglizhang-seu

Hello. In Section 5.2 “Model Distillation,” the document clearly states that the smaller models (ViT-Small, ViT-Base, ViT-Large) do not use Gram loss (i.e., the Gram-anchoring technique) during distillation. Specifically, in vitl_im1k_lin834.yaml Gram loss appears to be disabled (cfg.gram.use_loss is set to false). I’m therefore wondering: if I continue SSL pre-training from the dinov3-vitl16-pretrain-lvd1689m checkpoint, should Gram loss be turned on?

Do you know what to do?

wz1217694175 avatar Dec 13 '25 05:12 wz1217694175