Why the translation value in the camera extrinsic is very small ?
Hi, thanks a lot for open-sourcing such great work!
I’m planning to create a dataset similar to RealEstate10K, and I’d like to use VGGT to estimate both camera intrinsics and extrinsics. I ran python demo_gradio.py and tested it with some images extracted from a video where the camera slowly moves forward.
My expectation was that the predicted camera frustums would line up in a row, but instead, the output frustums almost collapse into a single one. Does this normal ?
I also checked the generated predictions.npz, and the values under the extrinsic key are indeed very small—many of them are at the 0.0000x scale. Is this expected behavior?
I’m new to this field, so apologies if my question sounds naive. Thanks!
The images I used to test: images.zip
I noticed this sentence in the paper: “we compute the average Euclidean distance of all 3D points in the point map P to the origin and use this scale to normalize the camera translations t.”
Can I interpret this as follows: if my camera only moves 20 cm, while the average distance from the first-frame origin to all 3D points is 20 meters, then the change in translation along the z-axis would be normalized to about 0.2/20 = 0.01. However, in my case, the predicted values are even smaller—around 0.000x instead of 0.0x.
Yeah I think your understanding is correct. While for your example, could you try to adjust the confidence score a bit?
Hi, did you solve it? @EmmaThompson123