dinov2
dinov2 copied to clipboard
DINOv2 performance vs DINO
hi,
I am working on a custom medical imaging binary classification task. I am comparing performances between DINO(VIT base) and DINOv2(VIT large) after training DINO and DINOv2 models, with eval_linear code. With DINO using VIT base I am getting 85% classification accuracy on the test dataset, with the model trained for 100 epochs, whereas with DINOv2 trained for 300 epochs from training_362499/teacher_checkpoint model, I am getting only 50% accuracy on the same test dataset.
I know it's hard to know the reason for this but any ideas that I could try or something that I might be doing wrong? Any suggestions would be appreciated.
Thanks, Rohan
Hi @rbareja25, I am experiencing similar issues, i.e. I am struggling to reproduce DINO results using DINOv2.
In our case we used exactly the same hyper parameters as in DINO. We also conducted an ablation study, switching off KoLeo and iBOT losses without any success.
@qasfb @patricklabatut Would you expect to get comparable results to DINO if you trained DINOv2 with ibot_loss_weight = 0, koleo_loss_weight = 0 and identical hyperparameters?
Hi @vladchimescu @rbareja25 I faced the same issue with segmentation based on dino vs dinov2 features. Then another look up in dinov2 paper reveals the authors are aware of such a discrepency:
from dinov2 paper: dinov2
When used to extract features, it delivers disappointing performance, only on par with supervised alternative backbones in this scenario. This suggests that DINOv2 behaves differently than DINO. The investigation described in this work notably exposes the presence of artefacts in the feature maps of DINOv2 that were not present in the first version of this model
[…]
We note that we have not been able to fully determine which aspects of the training led to the appearance of artifacts in DINOv2 but not in DINO, but Fig. 4 suggests that scaling the model size beyond ViT-L, and longer training length may be possible causes
Hope it helps to clear up some confusion.
P.S. i haven't tried the dinov2 with registers though, if anything changes, I'll report here:)
Hi @alexaatm, thank you for pointing this out.
@qasfb @patricklabatut I believe this is an implementation issue. We set KoLeo and iBOT loss weights to zero and we could not reproduce the good performance of DINOv1, despite using the same hyperparameters.
We haven't tried ViTs with registers as we're training DINOv2 + ViT-S. The paper that you cite @alexaatm suggests that registers are needed for large ViTs trained on massive datasets.
Hey ! Can you say a bit more about your setup, how many GPUs, what batch size and what config in general ?
@qasfb Sure! We are using NVIDIA A100 GPUs (80 GB GPU memory). For the ViT-S backbone, we only need 1 GPU, but we also tried FSDP with 2 GPUs.
We used the same batch size (256) as for DINO (Caron et al, 2021). We found that for our custom dataset, having batch size <= 256 was beneficial.
To reproduce our DINO baseline, we switched off iBOT and KoLeo losses and used the same student and teacher hyperparameters that we had previously used for DINO. I'm pasting the config file below:
MODEL:
WEIGHTS: ''
compute_precision:
grad_scaler: true
teacher:
backbone:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
dino_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
ibot_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
student:
backbone:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
dino_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp32
buffer_dtype: fp32
ibot_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp32
buffer_dtype: fp32
dino:
loss_weight: 1.0
head_n_prototypes: 10000
head_bottleneck_dim: 256
head_nlayers: 3
head_hidden_dim: 2048
koleo_loss_weight: 0
ibot:
loss_weight: 0
mask_sample_probability: 0.5
mask_ratio_min_max:
- 0.1
- 0.5
separate_head: false
head_n_prototypes: 10000
head_bottleneck_dim: 256
head_nlayers: 3
head_hidden_dim: 2048
train:
batch_size_per_gpu: 256
output_dir: .
saveckp_freq: 10
seed: 0
num_workers: 12
OFFICIAL_EPOCH_LENGTH: 2204
cache_dataset: true
centering: "centering" # or "sinkhorn_knopp"
student:
arch: vit_small
patch_size: 16
drop_path_rate: 0
layerscale: 1.0e-05
drop_path_uniform: true
pretrained_weights: ''
ffn_layer: "mlp"
block_chunks: 0
qkv_bias: true
proj_bias: true
ffn_bias: true
teacher:
momentum_teacher: 0.9995
final_momentum_teacher: 1
warmup_teacher_temp: 0.01
teacher_temp: 0.04 # TODO this should be set to 0.04 (!)
warmup_teacher_temp_epochs: 30
optim:
epochs: 200
weight_decay: 0.04
weight_decay_end: 0.4
base_lr: 0.001 # learning rate for a batch size of 256
lr: 0. # will be set after applying scaling rule
warmup_epochs: 20
min_lr: 1.0e-06
clip_grad: 3.0
freeze_last_layer_epochs: 3
scaling_rule: multiple_of_256
patch_embed_lr_mult: 0.2
layerwise_decay: 0.9
adamw_beta1: 0.9
adamw_beta2: 0.999
crops:
global_crops_scale:
- 0.32
- 1.0
local_crops_number: 8
local_crops_scale:
- 0.05
- 0.32
global_crops_size: 224
local_crops_size: 96
evaluation:
eval_period_iterations: 12500
How did you pick the hyperparameters in this config ? I see layerwise decay 0.9 and momentum teacher 0.9995, and i'm pretty sure these were not in DINO. Similarly the layerscale does not happen in DINO either.
@qasfb The teacher momentum is from the original DINO and we used the value that produced the best results.
Indeed, layer scale was not part of the original DINO and we also ran an experiment without layer scale, which didn't make a difference.
I am not sure what layerwise_decay
was and I left it as in one of the config files that I found in this repo. What would be the equivalent value of layerwise_decay
for DINO(v1)?
Do you have any other pointers regarding what is different compared to vanilla DINO (after switching off iBOT and KoLeo)?
To disable: layerwise decay = 1.0 Otherwise I don't know, I have not tried to reproduce DINO with this codebase so I don't know for sure if it would work well.
My config file is here:
MODEL:
WEIGHTS: ''
compute_precision:
grad_scaler: true
teacher:
backbone:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
dino_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
ibot_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
student:
backbone:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
dino_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp32
buffer_dtype: fp32
ibot_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp32
buffer_dtype: fp32
dino:
loss_weight: 1.0
head_n_prototypes: 65536
head_bottleneck_dim: 256
head_nlayers: 3
head_hidden_dim: 2048
koleo_loss_weight: 0.1
ibot:
loss_weight: 1.0
mask_sample_probability: 0.5
mask_ratio_min_max:
- 0.1
- 0.5
separate_head: false
head_n_prototypes: 65536
head_bottleneck_dim: 256
head_nlayers: 3
head_hidden_dim: 2048
train:
batch_size_per_gpu: 8
dataset_path: ImageNet:split=TRAIN
output_dir: /home/rbareja/dinov2/dinov2path_32tumors_1000patches_300ep
saveckp_freq: 20
seed: 0
num_workers: 10
OFFICIAL_EPOCH_LENGTH: 1250
cache_dataset: true
centering: centering
student:
arch: vit_large
patch_size: 16
drop_path_rate: 0.3
layerscale: 1.0e-05
drop_path_uniform: true
pretrained_weights: ''
ffn_layer: mlp
block_chunks: 4
qkv_bias: true
proj_bias: true
ffn_bias: true
teacher:
momentum_teacher: 0.992
final_momentum_teacher: 1
warmup_teacher_temp: 0.04
teacher_temp: 0.07
warmup_teacher_temp_epochs: 30
optim:
epochs: 300
weight_decay: 0.04
weight_decay_end: 0.4
base_lr: 0.004
lr: 0.0007071067811865476
warmup_epochs: 10
min_lr: 1.0e-06
clip_grad: 3.0
freeze_last_layer_epochs: 1
scaling_rule: sqrt_wrt_1024
patch_embed_lr_mult: 0.2
layerwise_decay: 0.9
adamw_beta1: 0.9
adamw_beta2: 0.999
crops:
global_crops_scale:
- 0.32
- 1.0
local_crops_number: 8
local_crops_scale:
- 0.05
- 0.32
global_crops_size: 224
local_crops_size: 96
evaluation:
eval_period_iterations: 12500
If we turn off the changes you mentioned, what could happen from 1.) DINOV2 performance improves as we turn off the hyperparameters that might be affecting its performance? or 2) will this basically be same as DINO?
Hi @vladchimescu - Were you able to improve DINOV2 performance?Could you share suggestions?
How did you pick the hyperparameters in this config ? I see layerwise decay 0.9 and momentum teacher 0.9995, and i'm pretty sure these were not in DINO. Similarly the layerscale does not happen in DINO either.
I see momentum teacher is present in DINO.
Hey @rbareja25 @vladchimescu @qasfb ,
What is your eta for using 1 GPU (A100 80GB), how many images are you using?
Also, @vladchimescu, you set your OFFICIAL_EPOCH_LENGTH as 2204 and batch size as 256, does this mean you have 2204 x 256 training samples?
I am trying to train it with a custom dataset as well. I am getting an ETA of (eta: 142 days, 10:53:15)
(using 1 A100 80 GB GPU) (which seems very high for 80K images). Here is my config:
MODEL:
WEIGHTS: ''
compute_precision:
grad_scaler: true
teacher:
backbone:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
dino_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
ibot_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
student:
backbone:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp16
buffer_dtype: fp32
dino_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp32
buffer_dtype: fp32
ibot_head:
sharding_strategy: SHARD_GRAD_OP
mixed_precision:
param_dtype: fp16
reduce_dtype: fp32
buffer_dtype: fp32
dino:
loss_weight: 1.0
head_n_prototypes: 65536
head_bottleneck_dim: 256
head_nlayers: 3
head_hidden_dim: 2048
koleo_loss_weight: 0.1
ibot:
loss_weight: 1.0
mask_sample_probability: 0.5
mask_ratio_min_max:
- 0.1
- 0.5
separate_head: false
head_n_prototypes: 65536
head_bottleneck_dim: 256
head_nlayers: 3
head_hidden_dim: 2048
train:
batch_size_per_gpu: 64
dataset_path: ChestX_ray14
output_dir: /scratch/rnolas66/checkpoints/dinov2/vit-base-random
saveckp_freq: 20
seed: 0
num_workers: 8
OFFICIAL_EPOCH_LENGTH: 1250
cache_dataset: true
centering: sinkhorn_knopp
student:
arch: vit_base
patch_size: 14
drop_path_rate: 0.4
layerscale: 1.0e-05
drop_path_uniform: true
pretrained_weights: ''
ffn_layer: swiglufused
block_chunks: 4
qkv_bias: true
proj_bias: true
ffn_bias: true
num_register_tokens: 0
interpolate_antialias: false
interpolate_offset: 0.1
teacher:
momentum_teacher: 0.994
final_momentum_teacher: 1
warmup_teacher_temp: 0.04
teacher_temp: 0.07
warmup_teacher_temp_epochs: 30
optim:
epochs: 500
weight_decay: 0.04
weight_decay_end: 0.2
base_lr: 0.0002
lr: 2.1650635094610966e-05
warmup_epochs: 80
min_lr: 1.0e-06
clip_grad: 3.0
freeze_last_layer_epochs: 1
scaling_rule: sqrt_wrt_1024
patch_embed_lr_mult: 0.2
layerwise_decay: 1.0
adamw_beta1: 0.9
adamw_beta2: 0.999
crops:
global_crops_scale:
- 0.32
- 1.0
local_crops_number: 8
local_crops_scale:
- 0.05
- 0.32
global_crops_size: 224
local_crops_size: 98
evaluation:
eval_period_iterations: 12500
Do you know what I could be doing wrong here?
any clue about what happend?