vggt Call for suggestions on novel view synthesis(NVS) downstream task

Hi, great works!

The pointcloud outputed from pretrained-vggt model is pretty geometric correct, especially the pts from camera_head+depth_head.

However, the output point cloud lacks a clear physical scale, such as metric units, which is because you apply normalization on GT pointcloud when training.

Here, I want to train a NVS downstream task network based on your pretrained vggt weights ( maybe 3DGS for render procedure ). I have the metrc camera infomation K R T and its correspoing rgb images, but I do not have GT depth/pointcloud for supervision. So I just want to use novel_view rgb loss to train.

The amount and diversity of my data — particularly in terms of camera viewpoints and sample count — are significantly lower than those used in your pretraining dataset.

Is it feasible to finetune the full vggt pretrained model weights to output metric KRT, depths, wpts just supervised by novel view rgb_loss & GT camera_loss ?

I will appreciate it if you could provide some suggesitions.

Best wishes, VillardX

Jun 05 '25 08:06 VillardX

Hi,

Yes, I believe it’s possible. For fine-tuning, you can use photometric loss and camera pose loss, based on multi-view consistency. For example, given several images, you can first predict depth maps and camera poses. Then, you unproject the depth maps into 3D points and reproject them into other views using the estimated poses. This allows you to warp one image to another view and compute the color difference as supervision.

Keep in mind that this approach assumes a strictly rigid scene, any non-rigid motion will violate multi-view consistency and degrade performance.

Jun 06 '25 15:06 jytime

Thanks for reply. My nvs dataset actually assumes a strictly rigid scene. So in your opinion, it's better to finetinue VGGT based on camera_head+depth_head's output, instead of point_head's output?

Actually, I tried to finetune the full weights of aggregator and point_head in vggt, and directly take predictions["world_points"] as my gaussian renderer's xyz input (the definition of predictions["world_points"] is intuitive, meaning the each pixel's 3D representation in the coordinate system of the first input frame). However, the nvs rgb loss fly away fastly and the nvs results only presents "bg_color".

However, when I only unfreeze camera_head and use camera_head+depth_head to construct the gaussian renderer's xyz input, the loss could decreases as expected. The psnr could achieve about 30.

I wonder in the paper of Para. 4.6, if you trained LVSM with all VGGT weights frozen? Seems finetuning the full VGGT weights leads catastrophic forgetting.

Jun 09 '25 06:06 VillardX

Hello, I saw similar ideas in a recent method called AnySplat. They used a frozen backbone initialized with VGGT weights, jointly trained the Gaussian head, camera head, and depth head, and utilized another pre-trained VGGT model to generate pseudo-ground truth for supervision. Maybe you can check their paper for some details.

Jun 10 '25 08:06 JiamingZang

Hello, I saw similar ideas in a recent method called AnySplat. They used a frozen backbone initialized with VGGT weights, jointly trained the Gaussian head, camera head, and depth head, and utilized another pre-trained VGGT model to generate pseudo-ground truth for supervision. Maybe you can check their paper for some details.

Thanks for advice! I will check it later

Jun 10 '25 09:06 VillardX

Thanks for reply. My nvs dataset actually assumes a strictly rigid scene. So in your opinion, it's better to finetinue VGGT based on camera_head+depth_head's output, instead of point_head's output?

Actually, I tried to finetune the full weights of aggregator and point_head in vggt, and directly take predictions["world_points"] as my gaussian renderer's xyz input (the definition of predictions["world_points"] is intuitive, meaning the each pixel's 3D representation in the coordinate system of the first input frame). However, the nvs rgb loss fly away fastly and the nvs results only presents "bg_color".

However, when I only unfreeze camera_head and use camera_head+depth_head to construct the gaussian renderer's xyz input, the loss could decreases as expected. The psnr could achieve about 30.

I wonder in the paper of Para. 4.6, if you trained LVSM with all VGGT weights frozen? Seems finetuning the full VGGT weights leads catastrophic forgetting.

Please tell me, if the camerahead is unfrozen, how to supervise the camera and whether normalization is still required. In addition, I understand that when camera+depth is projected, depth is the normalized content.

Jun 28 '25 06:06 GG-Bonds

Thanks for reply. My nvs dataset actually assumes a strictly rigid scene. So in your opinion, it's better to finetinue VGGT based on camera_head+depth_head's output, instead of point_head's output? Actually, I tried to finetune the full weights of aggregator and point_head in vggt, and directly take predictions["world_points"] as my gaussian renderer's xyz input (the definition of predictions["world_points"] is intuitive, meaning the each pixel's 3D representation in the coordinate system of the first input frame). However, the nvs rgb loss fly away fastly and the nvs results only presents "bg_color". However, when I only unfreeze camera_head and use camera_head+depth_head to construct the gaussian renderer's xyz input, the loss could decreases as expected. The psnr could achieve about 30. I wonder in the paper of Para. 4.6, if you trained LVSM with all VGGT weights frozen? Seems finetuning the full VGGT weights leads catastrophic forgetting.

Please tell me, if the camerahead is unfrozen, how to supervise the camera and whether normalization is still required. In addition, I understand that when camera+depth is projected, depth is the normalized content.

When I finetuned VGGT with camera_head+depth_head, I provided the ground truth(GT) KRT camera infromations of src and target views. I used GT src KRT to supervise camera_head. The GT src KRT is metric, so camera_head and depth_head are expected to output metric KRT and metric depth based on camera_loss from src views & NVS rgb loss from src views' pred KRT and depth along with provided GT target KRT.

Jul 01 '25 03:07 VillardX