VGGT Finetuning Issue on VKITTI Datasets
Hi, thank you for open sourcing this awesome work! We are trying to finetune the VGGT (from original checkpoint) on the VKITTI dataset on a single NVIDIA 6000 RTX ADA (48GB) but the output visualization from demo_viser seems to have duplicated and layered point clouds (attached in the screenshots). We are finetuning with a frozen aggregator. We've also attached our default.yaml file below.
Here's our default.yaml code;
defaults:
- self
- default_dataset.yaml
exp_name: exp010_ft_og_ckpt_corrected
img_size: 518
num_workers: 8
seed_value: 42
accum_steps: 2 # We did not use gradient accumulation in our training, while if you suffer from OOM, you can try to use it.
patch_size: 14
val_epoch_freq: 1000000000
max_img_per_gpu: 48
limit_train_batches: 800 limit_val_batches: 100
data: train: target: data.dynamic_dataloader.DynamicTorchDataset num_workers: ${num_workers} max_img_per_gpu: ${max_img_per_gpu} common_config: img_size: ${img_size} patch_size: ${patch_size} debug: True repeat_batch: False dataset: target: data.composed_dataset.ComposedDataset dataset_configs: - target: data.datasets.vkitti.VKittiDataset split: train VKitti_DIR: /home/sbangal4/vggt_new/vggt/data/vkitti/vkitti
val: target: data.dynamic_dataloader.DynamicTorchDataset num_workers: ${num_workers} max_img_per_gpu: ${max_img_per_gpu} common_config: img_size: ${img_size} patch_size: ${patch_size} debug: True dataset: target: data.composed_dataset.ComposedDataset dataset_configs: - target: data.datasets.vkitti.VKittiDataset split: train VKitti_DIR: /home/sbangal4/vggt_new/vggt/data/vkitti/vkitti
logging: log_dir: logs log_visuals: False log_freq: 1 log_level_primary: DEBUG log_level_secondary: WARNING all_ranks: False tensorboard_writer: target: train_utils.tb_writer.TensorBoardLogger path: ${logging.log_dir}/tensorboard scalar_keys_to_log: train: keys_to_log: - loss_objective - loss_camera - loss_T - loss_R - loss_FL - loss_conf_depth - loss_reg_depth - loss_grad_depth val: keys_to_log: - loss_objective - loss_camera - loss_T - loss_R - loss_FL - loss_conf_depth - loss_reg_depth - loss_grad_depth
checkpoint: save_dir: logs/${exp_name}/ckpts save_freq: 5 resume_checkpoint_path: /home/sbangal4/vggt_new/vggt/checkpoint/model.pt strict: False
loss:
target: loss.MultitaskLoss
camera:
weight: 5.0
loss_type: "l1" # The paper uses smooth l1 loss, but we found l1 loss is more stable than smooth l1 and l2 loss.
depth:
weight: 1.0
gradient_loss_fn: "grad"
valid_range: 0.98
#point: null
#If you want to enable point, use the following config
point:
weight: 1.0
gradient_loss_fn: "normal"
valid_range: 0.98
track: null
optim: param_group_modifiers: False
optimizer: target: torch.optim.AdamW lr: 1e-6 weight_decay: 0.05
frozen_module_names: - ["aggregator"]
amp:
enabled: True
amp_dtype: bfloat16
gradient_clip:
target: train_utils.gradient_clip.GradientClipper
configs:
- module_name: [""]
params: [".*"]
max_norm: 1.0
norm_type: 2
options: lr: - scheduler: target: fvcore.common.param_scheduler.CompositeParamScheduler schedulers: - target: fvcore.common.param_scheduler.LinearParamScheduler start_value: 1e-8 end_value: 5e-5 - target: fvcore.common.param_scheduler.CosineParamScheduler start_value: 5e-5 end_value: 1e-8 lengths: [0.05, 0.95] interval_scaling: ['rescaled', 'rescaled'] weight_decay: - scheduler: target: fvcore.common.param_scheduler.ConstantParamScheduler value: 0.05
max_epochs: 20
model: target: vggt.models.vggt.VGGT enable_camera: True enable_depth: True enable_point: True enable_track: True
distributed: backend: nccl comms_dtype: None find_unused_parameters: True timeout_mins: 30 gradient_as_bucket_view: True # Less memory used bucket_cap_mb: 25 broadcast_buffers: True
cuda: cudnn_deterministic: False cudnn_benchmark: False allow_tf32: True
Hello, have you found solutions? I'm also curious how is your training conducted with
limit_train_batches: 800
limit_val_batches: 100
This setting is only for testing which means not enough gradient steps are run.
Hello, have you found solutions? I'm also curious how is your training conducted with
limit_train_batches: 800 limit_val_batches: 100This setting is only for testing which means not enough gradient steps are run.
Hello, I also noticed this issue, but after setting limit_train_batches to null, the training will proceed according to len_train, right? The default value of 100000 is really too large.