Training from scratch smaller model
Hi! Thank you for your great work and for being so responsive in the github issues.
I am writing a paper where I want to train a smaller version of VGGT (due to my limit compute and storage budget) for some comparison.
I have collected the datasets: CO3Dv2, MegaDepth, BlendedMVS, VKitti, Hypersim, WildRGBD, ScanNet, PointOdyssey and MVSSynth.
I am training a model (32 GPUs) with parameters based on your provided finetuning config (see below).
The model is smaller (embed_dim: 768, depth: 6, num_heads: 12) and the encoder is half frozen. The image resolution is lower and I train for only 100k steps.
Below are some loss curves:
Problem: The problem I have is that I only achieve like 55AUC@30 compared to the actual VGGT that achieves like 90 on CO3Dv2. Do you have any thoughts on what might be the leading cause? Sorry for bombarding you :)
Much appreciated,
/ David
defaults:
- default_dataset.yaml
exp_name: exp001
log_wandb: true
img_size: 336 # 400 # 480 # 512
num_workers: 8
seed_value: 42
accum_steps: 1 # We did not use gradient accumulation in our training, while if you suffer from OOM, you can try to use it.
patch_size: 16
val_epoch_freq: 10
max_img_per_gpu: 48
limit_train_batches: 1000
limit_val_batches: 100
data:
# The code for data still looks too complicated. I should refactor this again (do I have time?...)
train:
_target_: data.dynamic_dataloader.DynamicTorchDataset
num_workers: ${num_workers}
max_img_per_gpu: ${max_img_per_gpu}
common_config:
img_size: ${img_size}
patch_size: ${patch_size}
debug: False
repeat_batch: False
dataset:
_target_: data.composed_dataset.ComposedDataset
dataset_configs:
- _target_: data.datasets.megadepth.MegadepthDataset
split: train
MEGADEPTH_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/data/megadepth
MEGADEPTH_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
len_train: 100000
- _target_: data.datasets.scannet.ScanNetDataset
split: train
SCANNET_DIR: /mimer/NOBACKUP/groups/3d-dl/scannet/scans/scans_train
SCANNET_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
len_train: 100000
- _target_: data.datasets.vkitti.VKittiDataset
split: train
VKitti_DIR: /mimer/NOBACKUP/groups/3d-dl/vkitti
len_train: 100000
expand_ratio: 8
- _target_: data.datasets.mvssynth.MVSSynthDataset
split: train
MVSSYNTH_DIR: /mimer/NOBACKUP/groups/3d-dl/MVS-Synth/GTAV_540
MVSSYNTH_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
len_train: 100000
- _target_: data.datasets.blendedmvs.BlendedMVSDataset
split: train
BLENDEDMVS_DIR: /mimer/NOBACKUP/groups/3d-dl/blendedmvs_full
BLENDEDMVS_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
len_train: 100000
- _target_: data.datasets.pointodyssey.PointOdysseyDataset
split: train
POINTODYSSEY_DIR: /mimer/NOBACKUP/groups/3d-dl/pointodyssey
POINTODYSSEY_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
len_train: 100000
- _target_: data.datasets.hypersim.HypersimDataset
split: train
HYPERSIM_DIR: /mimer/NOBACKUP/groups/3d-dl/ml-hypersim/contrib/99991/downloads
HYPERSIM_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
len_train: 100000
- _target_: data.datasets.wildrgbd.WildrgbdDataset
split: train
WILDRGBD_DIR: /mimer/NOBACKUP/groups/3d-dl/wildrgbd
WILDRGBD_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
len_train: 100000
- _target_: data.datasets.co3dv2.Co3dDataset
split: train
CO3D_DIR: /mimer/NOBACKUP/groups/3d-dl/co3dv2
CO3D_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations/co3d
len_train: 100000
val:
_target_: data.dynamic_dataloader.DynamicTorchDataset
num_workers: ${num_workers}
max_img_per_gpu: ${max_img_per_gpu}
common_config:
img_size: ${img_size}
patch_size: ${patch_size}
debug: False
dataset:
_target_: data.composed_dataset.ComposedDataset
dataset_configs:
- _target_: data.datasets.megadepth.MegadepthDataset
split: test
MEGADEPTH_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/data/megadepth
MEGADEPTH_ANNOTATION_DIR: /mimer/NOBACKUP/groups/snic2022-6-266/davnords/vggt/annotations
logging:
log_dir: logs
log_visuals: False
log_freq: 5
log_level_primary: DEBUG
log_level_secondary: WARNING
all_ranks: False
tensorboard_writer:
_target_: train_utils.tb_writer.TensorBoardLogger
path: ${logging.log_dir}/tensorboard
scalar_keys_to_log:
train:
keys_to_log:
- loss_objective
- loss_camera
- loss_T
- loss_R
- loss_FL
- loss_conf_depth
- loss_reg_depth
- loss_grad_depth
val:
keys_to_log:
- loss_objective
- loss_camera
- loss_T
- loss_R
- loss_FL
- loss_conf_depth
- loss_reg_depth
- loss_grad_depth
checkpoint:
save_dir: logs/${exp_name}/ckpts
save_freq: 20
resume_checkpoint_path: # /YOUR/PATH/TO/CKPT
strict: False
loss:
_target_: loss.MultitaskLoss
camera:
weight: 5.0
loss_type: "l1" # The paper uses smooth l1 loss, but we found l1 loss is more stable than smooth l1 and l2 loss.
depth:
weight: 1.0
gradient_loss_fn: "grad"
valid_range: 0.98
# point: null
# If you want to enable point, use the following config
point:
weight: 1.0
gradient_loss_fn: "normal"
valid_range: 0.98
track: null
optim:
param_group_modifiers: False
optimizer:
_target_: torch.optim.AdamW
lr: 1e-4 # 5e-5
weight_decay: 0.05
frozen_module_names:
# - "*aggregator*" # example, freeze the aggregator
amp:
enabled: True
amp_dtype: bfloat16
gradient_clip:
_target_: train_utils.gradient_clip.GradientClipper
configs:
- module_name: ["aggregator"]
max_norm: 1.0 # feel free to reduce this if you see instabilities
norm_type: 2
- module_name: ["depth"]
max_norm: 1.0 # feel free to reduce this if you see instabilities
norm_type: 2
- module_name: ["camera"]
max_norm: 1.0 # feel free to reduce this if you see instabilities
norm_type: 2
- module_name: ["point"]
max_norm: 1.0 # feel free to reduce this if you see instabilities
norm_type: 2
options:
lr:
- scheduler:
_target_: fvcore.common.param_scheduler.CompositeParamScheduler
schedulers:
- _target_: fvcore.common.param_scheduler.LinearParamScheduler
start_value: 1e-8
end_value: ${optim.optimizer.lr}
- _target_: fvcore.common.param_scheduler.CosineParamScheduler
start_value: ${optim.optimizer.lr}
end_value: 1e-8
lengths: [0.05, 0.95]
interval_scaling: ['rescaled', 'rescaled']
weight_decay:
- scheduler:
_target_: fvcore.common.param_scheduler.ConstantParamScheduler
value: 0.05
max_epochs: 100
# Base:
# embed_dim=768
# depth=12
# num_heads=12
# Large:
# embed_dim=1024
# depth=24
# num_heads=16
model:
_target_: vggt.models.vggt_small.VGGT
img_size: ${img_size}
embed_dim: 768
depth: 6
num_heads: 12
enable_camera: True
enable_depth: True
enable_point: True
enable_track: False
patch_size: ${patch_size}
patch_embed: dinov3 # crocov2 # mum # dinov3
distributed:
# check https://docs.pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html for options
backend: nccl
comms_dtype: None
find_unused_parameters: False
timeout_mins: 30
gradient_as_bucket_view: True # Less memory used
bucket_cap_mb: 25
broadcast_buffers: True
cuda:
cudnn_deterministic: False
cudnn_benchmark: False
allow_tf32: True
Hi, I also am training VGGT with point head, but I found it is difficult to converge during training though I have checked the world point is right. Could you share the loss curve of point head?
Hi! I think I forgot to log the point head loss, here are the losses I logged: