vggt Are camera intrinsics shared between all images?

Hi there,

I work on a project that involves motion-capturing ants. I use 5 synchronised Basler cameras to film them.

While I know vggt is not meant for video data with static camera and moving subjects, I tried on a still from my 5 cameras, to see how it'd work.

I get this for intrinsics:

tensor([[[[2.1154e+03, 0.0000e+00, 2.5900e+02],
          [0.0000e+00, 2.1333e+03, 1.9600e+02],
          [0.0000e+00, 0.0000e+00, 1.0000e+00]],
         [[2.2836e+03, 0.0000e+00, 2.5900e+02],
          [0.0000e+00, 2.2914e+03, 1.9600e+02],
          [0.0000e+00, 0.0000e+00, 1.0000e+00]],
         [[2.1045e+03, 0.0000e+00, 2.5900e+02],
          [0.0000e+00, 2.0773e+03, 1.9600e+02],
          [0.0000e+00, 0.0000e+00, 1.0000e+00]],
         [[2.0306e+03, 0.0000e+00, 2.5900e+02],
          [0.0000e+00, 2.0084e+03, 1.9600e+02],
          [0.0000e+00, 0.0000e+00, 1.0000e+00]],
         [[2.3098e+03, 0.0000e+00, 2.5900e+02],
          [0.0000e+00, 2.2900e+03, 1.9600e+02],
          [0.0000e+00, 0.0000e+00, 1.0000e+00]]]], device='cuda:0')

and the central point values are suspiciously identical.

While I do use the same camera model and same lens for all 5 cameras, I'd assume the cx and cy values should vary?

Here is how the result looks like (that's with the gradio visualiser, and using Pointmap branch, as Depthmap and camera branch looks way worse:

For comparison, with another object (a calibration target) I get better results but still not great:

Are camera intrinsics assumed to be the same for all images? Thanks!

Here are the 5 files used in this example https://github.com/user-attachments/assets/f60197a5-bb5f-4bd4-8b48-376cc52c1010 https://github.com/user-attachments/assets/a67ddb25-a06f-4739-8e50-ed37da13aeb5 https://github.com/user-attachments/assets/80687539-edb6-4ce6-b741-3b33d0b68923 https://github.com/user-attachments/assets/3670de65-b29e-48e4-9778-a3fe09ee8d09 https://github.com/user-attachments/assets/7a3a7057-0c79-4179-a8c7-6f389652dd0e

May 16 '25 17:05 FlorentLM

Hi we do not assume the input images have the same intrinsics. You can notice that the focal length for each image is different.

However, the principal point is always assumed to stay at the center of images. So if your input images are of the same size, the principal point should be the same.

For the point cloud visualisation result, can you check if the ground plane predicted from different images match correctly? If so, probably you just need to filter out more points by controlling the conf thres, and the noisy points will go away.

May 18 '25 20:05 jytime

Thanks for your reply!

However, the principal point is always assumed to stay at the center of images.

Ah ok, fair enough!

For the point cloud visualisation result, can you check if the ground plane predicted from different images match correctly? If so, probably you just need to filter out more points by controlling the conf thres, and the noisy points will go away.

Yes the plane matches pretty well but not the remaining points. HEre's with a 80% confidence threshold:

And here's how it looks with the Depthmap and Camera Branch (still 80% conf tresh):

May 19 '25 10:05 FlorentLM

Yeah for the example of ant, it is a known issue. Since we included some dynamic datasets for training (e.g., tartanair, pointo), the model tends to "imagine" that some dynamic objects as moving, even if they are static. This is a problem of data prior and we hope to fix in the vggt v2. While in this case, the camera intrinsics should still be correct.

May 19 '25 15:05 jytime

Alright that makes sense, thanks for the info! This is some impressive work :)

May 19 '25 15:05 FlorentLM

@jytime Will VGGTv2 cover dynamic reconstruction? Like DUSt3R -> MONST3R?

May 20 '25 02:05 YJ-142150