vggt Training from scratch smaller model

Hi! Thank you for your great work and for being so responsive in the github issues.

I am writing a paper where I want to train a smaller version of VGGT (due to my limit compute and storage budget) for some comparison.

I have collected the datasets: CO3Dv2, MegaDepth, BlendedMVS, VKitti, Hypersim, WildRGBD, ScanNet, PointOdyssey and MVSSynth.

I am training a model (32 GPUs) with parameters based on your provided finetuning config (see below).

The model is smaller (embed_dim: 768, depth: 6, num_heads: 12) and the encoder is half frozen. The image resolution is lower and I train for only 100k steps.

Below are some loss curves:

Problem: The problem I have is that I only achieve like 55AUC@30 compared to the actual VGGT that achieves like 90 on CO3Dv2. Do you have any thoughts on what might be the leading cause? Sorry for bombarding you :)

Much appreciated,

/ David

defaults:
  - default_dataset.yaml

exp_name: exp001
log_wandb: true
img_size: 336 # 400 # 480 # 512
num_workers: 8
seed_value: 42
accum_steps: 1    # We did not use gradient accumulation in our training, while if you suffer from OOM, you can try to use it.
patch_size: 16
val_epoch_freq: 10
max_img_per_gpu: 48

limit_train_batches: 1000
limit_val_batches: 100

data:
  # The code for data still looks too complicated. I should refactor this again (do I have time?...)
  train:
    _target_: data.dynamic_dataloader.DynamicTorchDataset
    num_workers: ${num_workers}
    max_img_per_gpu: ${max_img_per_gpu}
    common_config:
      img_size: ${img_size}
      patch_size: ${patch_size}
      debug: False
      repeat_batch: False
    dataset:
      _target_: data.composed_dataset.ComposedDataset
      dataset_configs:
        - _target_: data.datasets.megadepth.MegadepthDataset
          split: train
          MEGADEPTH_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/data/megadepth
          MEGADEPTH_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
          len_train: 100000  
        - _target_: data.datasets.scannet.ScanNetDataset
          split: train
          SCANNET_DIR: /mimer/NOBACKUP/groups/3d-dl/scannet/scans/scans_train
          SCANNET_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
          len_train: 100000
        - _target_: data.datasets.vkitti.VKittiDataset
          split: train
          VKitti_DIR: /mimer/NOBACKUP/groups/3d-dl/vkitti
          len_train: 100000
          expand_ratio: 8 
        - _target_: data.datasets.mvssynth.MVSSynthDataset
          split: train
          MVSSYNTH_DIR: /mimer/NOBACKUP/groups/3d-dl/MVS-Synth/GTAV_540
          MVSSYNTH_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
          len_train: 100000
        - _target_: data.datasets.blendedmvs.BlendedMVSDataset
          split: train
          BLENDEDMVS_DIR: /mimer/NOBACKUP/groups/3d-dl/blendedmvs_full
          BLENDEDMVS_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
          len_train: 100000
        - _target_: data.datasets.pointodyssey.PointOdysseyDataset
          split: train
          POINTODYSSEY_DIR: /mimer/NOBACKUP/groups/3d-dl/pointodyssey
          POINTODYSSEY_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
          len_train: 100000
        - _target_: data.datasets.hypersim.HypersimDataset
          split: train
          HYPERSIM_DIR: /mimer/NOBACKUP/groups/3d-dl/ml-hypersim/contrib/99991/downloads
          HYPERSIM_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
          len_train: 100000
        - _target_: data.datasets.wildrgbd.WildrgbdDataset
          split: train
          WILDRGBD_DIR: /mimer/NOBACKUP/groups/3d-dl/wildrgbd
          WILDRGBD_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
          len_train: 100000

        - _target_: data.datasets.co3dv2.Co3dDataset
          split: train
          CO3D_DIR: /mimer/NOBACKUP/groups/3d-dl/co3dv2
          CO3D_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations/co3d
          len_train: 100000
  val:
    _target_: data.dynamic_dataloader.DynamicTorchDataset
    num_workers: ${num_workers}
    max_img_per_gpu: ${max_img_per_gpu}
    common_config:
      img_size: ${img_size}
      patch_size: ${patch_size}
      debug: False
    dataset:
      _target_: data.composed_dataset.ComposedDataset
      dataset_configs:
        - _target_: data.datasets.megadepth.MegadepthDataset
          split: test
          MEGADEPTH_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/data/megadepth
          MEGADEPTH_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations

logging:
  log_dir: logs
  log_visuals: False
  log_freq: 5
  log_level_primary: DEBUG
  log_level_secondary: WARNING
  all_ranks: False
  tensorboard_writer:
    _target_: train_utils.tb_writer.TensorBoardLogger
    path: ${logging.log_dir}/tensorboard
  scalar_keys_to_log:
    train:
      keys_to_log:
        - loss_objective
        - loss_camera
        - loss_T
        - loss_R
        - loss_FL
        - loss_conf_depth
        - loss_reg_depth
        - loss_grad_depth
    val:
      keys_to_log:
        - loss_objective
        - loss_camera
        - loss_T
        - loss_R
        - loss_FL
        - loss_conf_depth
        - loss_reg_depth
        - loss_grad_depth



checkpoint:
  save_dir: logs/${exp_name}/ckpts
  save_freq: 20
  resume_checkpoint_path: # /YOUR/PATH/TO/CKPT
  strict: False


loss:
  _target_: loss.MultitaskLoss
  camera: 
    weight: 5.0
    loss_type: "l1" # The paper uses smooth l1 loss, but we found l1 loss is more stable than smooth l1 and l2 loss.  
  depth:
    weight: 1.0
    gradient_loss_fn: "grad" 
    valid_range: 0.98
  # point: null
  # If you want to enable point, use the following config
  point: 
    weight: 1.0
    gradient_loss_fn: "normal" 
    valid_range: 0.98
  track: null   

optim:
  param_group_modifiers: False

  optimizer:
    _target_: torch.optim.AdamW
    lr: 1e-4 # 5e-5
    weight_decay: 0.05

  frozen_module_names:
      # - "*aggregator*"  # example, freeze the aggregator

  amp:
    enabled: True
    amp_dtype: bfloat16
  gradient_clip:
    _target_: train_utils.gradient_clip.GradientClipper
    configs:
      - module_name: ["aggregator"]
        max_norm: 1.0   # feel free to reduce this if you see instabilities
        norm_type: 2
      - module_name: ["depth"]
        max_norm: 1.0   # feel free to reduce this if you see instabilities
        norm_type: 2
      - module_name: ["camera"]
        max_norm: 1.0   # feel free to reduce this if you see instabilities
        norm_type: 2
      - module_name: ["point"]
        max_norm: 1.0   # feel free to reduce this if you see instabilities
        norm_type: 2
  options:
    lr:
      - scheduler:
          _target_: fvcore.common.param_scheduler.CompositeParamScheduler
          schedulers:
            - _target_: fvcore.common.param_scheduler.LinearParamScheduler
              start_value: 1e-8
              end_value: ${optim.optimizer.lr}
            - _target_: fvcore.common.param_scheduler.CosineParamScheduler
              start_value: ${optim.optimizer.lr}
              end_value: 1e-8
          lengths: [0.05, 0.95]
          interval_scaling: ['rescaled', 'rescaled']
    weight_decay:
      - scheduler:
          _target_: fvcore.common.param_scheduler.ConstantParamScheduler
          value: 0.05

max_epochs: 100

# Base: 
# embed_dim=768
# depth=12
# num_heads=12

# Large: 
# embed_dim=1024
# depth=24
# num_heads=16

model:
  _target_: vggt.models.vggt_small.VGGT
  img_size: ${img_size}
  embed_dim: 768
  depth: 6
  num_heads: 12

  enable_camera: True
  enable_depth: True
  enable_point: True
  enable_track: False
  patch_size: ${patch_size}
  patch_embed: dinov3 # crocov2 # mum # dinov3

distributed:
  # check https://docs.pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html for options
  backend: nccl
  comms_dtype: None
  find_unused_parameters: False
  timeout_mins: 30
  gradient_as_bucket_view: True  # Less memory used
  bucket_cap_mb: 25
  broadcast_buffers: True

cuda:
    cudnn_deterministic: False
    cudnn_benchmark: False
    allow_tf32: True

Oct 15 '25 11:10 davnords

Hi, I also am training VGGT with point head, but I found it is difficult to converge during training though I have checked the world point is right. Could you share the loss curve of point head?

Oct 22 '25 03:10 Jou719

Hi! I think I forgot to log the point head loss, here are the losses I logged:

Oct 22 '25 19:10 davnords