Directly input camera params into the model
Hi, thanks for this amazing work!
I am Gongjie from the robotics lab of Alibaba DAMO Academy.
I wonder if camera parameters can be directly injected into the model instead of relying on the predictions. In many practical cases, such as robotics / autonomous driving, camera parameters are pretty easy to obtain. I think directly injecting accurate params could help a lot.
Do you have any advice on achieving this? Thanks a lot. :)
Hi @ZhangGongjie ,
Thanks for the interest! Yes we are testing this feature, and should be included in our next version. Basically it requires to finetune a model to accommodate such an input. A straightforward idea is using something similar to DiT(https://github.com/facebookresearch/DiT).
In our own experiments, fine-tuning typically takes 1–2 days and already shows decent results.
Awesome!! 👍 👍 👍 May I know how many GPU hours you are fine-tuning for?
It was trained on 64 GPUs for 1-2 days , so approximately 1k-2k GPU hours.
Hi @ZhangGongjie ,
Thanks for the interest! Yes we are testing this feature, and should be included in our next version. Basically it requires to finetune a model to accommodate such an input. A straightforward idea is using something similar to DiT(https://github.com/facebookresearch/DiT).
In our own experiments, fine-tuning typically takes 1–2 days and already shows decent results.
I am really looking forward to this feature!
Are there any relevant codes and model weights released now?
May I ask if it is now supported to use the known intrinsic and poses of the camera?
Hey @jytime is this feature still being worked on? would be really helpful for a ton of robotics applications
Still waiting, may I ask when it will be released approximately?
Hi @jytime , following up on your suggestion about "using something similar to DiT": are you proposing replacing the LayerNorm in each Block of VGGT's Aggregate with DiT's AdaLN-Zero? I tried this modification in some initial experiments. However, it resulted in a significantly high loss during the early training stages. Is this expected behavior? Could you advise on where improvements might be needed (e.g., architecture adjustments) or if training settings (like the loss function or learning rate) should be tuned?