vggt icon indicating copy to clipboard operation
vggt copied to clipboard

Directly input camera params into the model

Open ZhangGongjie opened this issue 8 months ago • 5 comments

Hi, thanks for this amazing work!

I am Gongjie from the robotics lab of Alibaba DAMO Academy.

I wonder if camera parameters can be directly injected into the model instead of relying on the predictions. In many practical cases, such as robotics / autonomous driving, camera parameters are pretty easy to obtain. I think directly injecting accurate params could help a lot.

Do you have any advice on achieving this? Thanks a lot. :)

ZhangGongjie avatar Apr 22 '25 03:04 ZhangGongjie

Hi @ZhangGongjie ,

Thanks for the interest! Yes we are testing this feature, and should be included in our next version. Basically it requires to finetune a model to accommodate such an input. A straightforward idea is using something similar to DiT(https://github.com/facebookresearch/DiT).

In our own experiments, fine-tuning typically takes 1–2 days and already shows decent results.

jytime avatar Apr 23 '25 21:04 jytime

Awesome!! 👍 👍 👍 May I know how many GPU hours you are fine-tuning for?

ZhangGongjie avatar Apr 24 '25 09:04 ZhangGongjie

It was trained on 64 GPUs for 1-2 days , so approximately 1k-2k GPU hours.

jytime avatar Apr 24 '25 21:04 jytime

Hi @ZhangGongjie ,

Thanks for the interest! Yes we are testing this feature, and should be included in our next version. Basically it requires to finetune a model to accommodate such an input. A straightforward idea is using something similar to DiT(https://github.com/facebookresearch/DiT).

In our own experiments, fine-tuning typically takes 1–2 days and already shows decent results.

I am really looking forward to this feature!
Are there any relevant codes and model weights released now?

DingChunQ avatar May 12 '25 19:05 DingChunQ

May I ask if it is now supported to use the known intrinsic and poses of the camera?

missTL avatar May 30 '25 07:05 missTL

Hey @jytime is this feature still being worked on? would be really helpful for a ton of robotics applications

harshagundala avatar Jul 10 '25 18:07 harshagundala

Still waiting, may I ask when it will be released approximately?

ChambinLee avatar Jul 21 '25 02:07 ChambinLee

Hi @jytime , following up on your suggestion about "using something similar to DiT": are you proposing replacing the LayerNorm in each Block of VGGT's Aggregate with DiT's AdaLN-Zero? I tried this modification in some initial experiments. However, it resulted in a significantly high loss during the early training stages. Is this expected behavior? Could you advise on where improvements might be needed (e.g., architecture adjustments) or if training settings (like the loss function or learning rate) should be tuned?

CanvasEngineer avatar Aug 16 '25 16:08 CanvasEngineer