vggt icon indicating copy to clipboard operation
vggt copied to clipboard

VGGT Finetuning Issue on VKITTI Datasets

Open SaiPrasanthBL opened this issue 3 months ago • 2 comments

Hi, thank you for open sourcing this awesome work! We are trying to finetune the VGGT (from original checkpoint) on the VKITTI dataset on a single NVIDIA 6000 RTX ADA (48GB) but the output visualization from demo_viser seems to have duplicated and layered point clouds (attached in the screenshots). We are finetuning with a frozen aggregator. We've also attached our default.yaml file below.

Image Image

Here's our default.yaml code;

defaults:

  • self
  • default_dataset.yaml

exp_name: exp010_ft_og_ckpt_corrected img_size: 518 num_workers: 8 seed_value: 42 accum_steps: 2 # We did not use gradient accumulation in our training, while if you suffer from OOM, you can try to use it. patch_size: 14 val_epoch_freq: 1000000000
max_img_per_gpu: 48

limit_train_batches: 800 limit_val_batches: 100

data: train: target: data.dynamic_dataloader.DynamicTorchDataset num_workers: ${num_workers} max_img_per_gpu: ${max_img_per_gpu} common_config: img_size: ${img_size} patch_size: ${patch_size} debug: True repeat_batch: False dataset: target: data.composed_dataset.ComposedDataset dataset_configs: - target: data.datasets.vkitti.VKittiDataset split: train VKitti_DIR: /home/sbangal4/vggt_new/vggt/data/vkitti/vkitti

val: target: data.dynamic_dataloader.DynamicTorchDataset num_workers: ${num_workers} max_img_per_gpu: ${max_img_per_gpu} common_config: img_size: ${img_size} patch_size: ${patch_size} debug: True dataset: target: data.composed_dataset.ComposedDataset dataset_configs: - target: data.datasets.vkitti.VKittiDataset split: train VKitti_DIR: /home/sbangal4/vggt_new/vggt/data/vkitti/vkitti

logging: log_dir: logs log_visuals: False log_freq: 1 log_level_primary: DEBUG log_level_secondary: WARNING all_ranks: False tensorboard_writer: target: train_utils.tb_writer.TensorBoardLogger path: ${logging.log_dir}/tensorboard scalar_keys_to_log: train: keys_to_log: - loss_objective - loss_camera - loss_T - loss_R - loss_FL - loss_conf_depth - loss_reg_depth - loss_grad_depth val: keys_to_log: - loss_objective - loss_camera - loss_T - loss_R - loss_FL - loss_conf_depth - loss_reg_depth - loss_grad_depth

checkpoint: save_dir: logs/${exp_name}/ckpts save_freq: 5 resume_checkpoint_path: /home/sbangal4/vggt_new/vggt/checkpoint/model.pt strict: False

loss: target: loss.MultitaskLoss camera: weight: 5.0 loss_type: "l1" # The paper uses smooth l1 loss, but we found l1 loss is more stable than smooth l1 and l2 loss.
depth: weight: 1.0 gradient_loss_fn: "grad" valid_range: 0.98 #point: null #If you want to enable point, use the following config point: weight: 1.0 gradient_loss_fn: "normal" valid_range: 0.98 track: null

optim: param_group_modifiers: False

optimizer: target: torch.optim.AdamW lr: 1e-6 weight_decay: 0.05

frozen_module_names: - ["aggregator"]

amp: enabled: True amp_dtype: bfloat16 gradient_clip: target: train_utils.gradient_clip.GradientClipper configs: - module_name: [""]
params: [".*"]
max_norm: 1.0 norm_type: 2

options: lr: - scheduler: target: fvcore.common.param_scheduler.CompositeParamScheduler schedulers: - target: fvcore.common.param_scheduler.LinearParamScheduler start_value: 1e-8 end_value: 5e-5 - target: fvcore.common.param_scheduler.CosineParamScheduler start_value: 5e-5 end_value: 1e-8 lengths: [0.05, 0.95] interval_scaling: ['rescaled', 'rescaled'] weight_decay: - scheduler: target: fvcore.common.param_scheduler.ConstantParamScheduler value: 0.05

max_epochs: 20

model: target: vggt.models.vggt.VGGT enable_camera: True enable_depth: True enable_point: True enable_track: True

distributed: backend: nccl comms_dtype: None find_unused_parameters: True timeout_mins: 30 gradient_as_bucket_view: True # Less memory used bucket_cap_mb: 25 broadcast_buffers: True

cuda: cudnn_deterministic: False cudnn_benchmark: False allow_tf32: True

SaiPrasanthBL avatar Sep 07 '25 20:09 SaiPrasanthBL

Hello, have you found solutions? I'm also curious how is your training conducted with

limit_train_batches: 800
limit_val_batches: 100

This setting is only for testing which means not enough gradient steps are run.

youwyu avatar Nov 03 '25 22:11 youwyu

Hello, have you found solutions? I'm also curious how is your training conducted with

limit_train_batches: 800
limit_val_batches: 100

This setting is only for testing which means not enough gradient steps are run.

Hello, I also noticed this issue, but after setting limit_train_batches to null, the training will proceed according to len_train, right? The default value of 100000 is really too large.

Shexiaox avatar Nov 04 '25 09:11 Shexiaox