dinov3 icon indicating copy to clipboard operation
dinov3 copied to clipboard

How to efficiently load the model after training

Open quannguyenminh103 opened this issue 1 month ago • 3 comments

I am trying to train DINOv3 again on my custom dataset from the begining (3 stages). I can be able to train the first stage (8 H100 GPUs on one node - maximum resources). However,

  1. I faced with OOM slurmstepd: error: Detected 1 oom_kill event in StepId=58302118.0. Some of the step tasks have been OOM Killed. srun: error: g-44-01: task 7: Out Of Memory. It seems like it requires much more CPU memory? I believe by default the MEM is set to 0, which is all memory I can request. I already tried to reduce the number of workers to 1 and batch size to 4. It is still OOM. I am afraid lowering the batch size further may degrade performance. Is there any useful trick to resolve this issue?
  2. I tried to load the model from the first stage. I got the sharded ckpts (__0_0.distcp, etc). a) Can we directly load from these distcp? If yes, can you show me how? b) I tried to convert it to .pth with dcp_to_torch. Then I tried to load it with the following code:
import torch
import torch.distributed as dist
from torch.distributed.checkpoint import load
from torch.distributed.checkpoint import load_state_dict
from PIL import Image
from torchvision import transforms
import sys
sys.path.append(DINO_github)
from dinov3.models.vision_transformer import vit_large, vit_base, vit_small, vit_7b
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = vit_7b(patch_size=16).to(device)
state_dict = model.state_dict()
ckpt = torch.load(path/to/model.pth, map_location="cpu")

It takes so long to run (eventually causes the error: UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.) and cannot load the model properly. if I tried to change map_location='cuda', then it exceeds 80GB, which causes OOM issue. How should I properly load the model after training? Thank you

quannguyenminh103 avatar Nov 04 '25 05:11 quannguyenminh103

Just add evaluation part in the config file something like this evaluation: eval_period_iterations: 1 and run for one more step you'll find a eval directory with teacher_checkpoint.pth

blackpearl006 avatar Nov 19 '25 17:11 blackpearl006

I cannot load the teacher_checkpoint.pth I trained using your code, showing key mismatched. Did you meet my problem?

vicdxxx avatar Dec 01 '25 22:12 vicdxxx

in the config file use pretrained_weights instead of resume_from_teacher_chkpt

blackpearl006 avatar Dec 24 '25 09:12 blackpearl006