vggt Why the translation value in the camera extrinsic is very small ?

Hi, thanks a lot for open-sourcing such great work!

I’m planning to create a dataset similar to RealEstate10K, and I’d like to use VGGT to estimate both camera intrinsics and extrinsics. I ran python demo_gradio.py and tested it with some images extracted from a video where the camera slowly moves forward.

My expectation was that the predicted camera frustums would line up in a row, but instead, the output frustums almost collapse into a single one. Does this normal ?

I also checked the generated predictions.npz, and the values under the extrinsic key are indeed very small—many of them are at the 0.0000x scale. Is this expected behavior?

I’m new to this field, so apologies if my question sounds naive. Thanks!

The images I used to test: images.zip

Sep 19 '25 12:09 EmmaThompson123

I noticed this sentence in the paper: “we compute the average Euclidean distance of all 3D points in the point map P to the origin and use this scale to normalize the camera translations t.”

Can I interpret this as follows: if my camera only moves 20 cm, while the average distance from the first-frame origin to all 3D points is 20 meters, then the change in translation along the z-axis would be normalized to about 0.2/20 = 0.01. However, in my case, the predicted values are even smaller—around 0.000x instead of 0.0x.

Sep 19 '25 13:09 EmmaThompson123

Yeah I think your understanding is correct. While for your example, could you try to adjust the confidence score a bit?

Sep 24 '25 20:09 jytime

Hi, did you solve it? @EmmaThompson123

Oct 29 '25 10:10 XLR-man