vggt Directly input camera params into the model

Hi, thanks for this amazing work!

I am Gongjie from the robotics lab of Alibaba DAMO Academy.

I wonder if camera parameters can be directly injected into the model instead of relying on the predictions. In many practical cases, such as robotics / autonomous driving, camera parameters are pretty easy to obtain. I think directly injecting accurate params could help a lot.

Do you have any advice on achieving this? Thanks a lot. :)

Apr 22 '25 03:04 ZhangGongjie

Hi @ZhangGongjie ,

Thanks for the interest! Yes we are testing this feature, and should be included in our next version. Basically it requires to finetune a model to accommodate such an input. A straightforward idea is using something similar to DiT(https://github.com/facebookresearch/DiT).

In our own experiments, fine-tuning typically takes 1–2 days and already shows decent results.

Apr 23 '25 21:04 jytime

Awesome!! 👍 👍 👍 May I know how many GPU hours you are fine-tuning for?

Apr 24 '25 09:04 ZhangGongjie

It was trained on 64 GPUs for 1-2 days , so approximately 1k-2k GPU hours.

Apr 24 '25 21:04 jytime

Hi @ZhangGongjie ,

Thanks for the interest! Yes we are testing this feature, and should be included in our next version. Basically it requires to finetune a model to accommodate such an input. A straightforward idea is using something similar to DiT(https://github.com/facebookresearch/DiT).

In our own experiments, fine-tuning typically takes 1–2 days and already shows decent results.

I am really looking forward to this feature!
Are there any relevant codes and model weights released now?

May 12 '25 19:05 DingChunQ

May I ask if it is now supported to use the known intrinsic and poses of the camera?

May 30 '25 07:05 missTL

Hey @jytime is this feature still being worked on? would be really helpful for a ton of robotics applications

Jul 10 '25 18:07 harshagundala

Still waiting, may I ask when it will be released approximately?

Jul 21 '25 02:07 ChambinLee

Hi @jytime , following up on your suggestion about "using something similar to DiT": are you proposing replacing the LayerNorm in each Block of VGGT's Aggregate with DiT's AdaLN-Zero? I tried this modification in some initial experiments. However, it resulted in a significantly high loss during the early training stages. Is this expected behavior? Could you advise on where improvements might be needed (e.g., architecture adjustments) or if training settings (like the loss function or learning rate) should be tuned?

Aug 16 '25 16:08 CanvasEngineer