vggt icon indicating copy to clipboard operation
vggt copied to clipboard

About denormalization in inference

Open LeryLee opened this issue 5 months ago • 6 comments

VGGT is impressive! I just have a question: during training, "Ground Truth Coordinate Normalization" was applied. Does we need a denormalization step during inference to recover the original coordinates?

LeryLee avatar Jul 26 '25 11:07 LeryLee

Ground Truth Coordinate Normalization We follow [129] and, first, express all quantities in the coordinate frame of the first camera g1. Then, we compute the average Euclidean distance of all 3D points in the point map P to the origin and use this scale to normalize the camera translations t, the point map P, and the depth map D.

LeryLee avatar Jul 26 '25 11:07 LeryLee

Hi, I visit lots of issues under this repository, found that we only predict a normalized scene with normlized coordinates but not actual coordinates. Then I have another question: could I train the model without "Ground Truth Coordinate Normalization", so that I can use the actual coordinates or processed coordinates in my dataloader (without "Ground Truth Coordinate Normalization", too) directly in inference. What is the effect of this? For example, could it make the training unstable? Thanks!

LeryLee avatar Jul 28 '25 02:07 LeryLee

hi LeryLee i had the same question, could you find an answer to it?

sankalpkallakuri avatar Jul 28 '25 05:07 sankalpkallakuri

hi LeryLee i had the same question, could you find an answer to it?

not yet, I still try to figure out how to recover the original coordinates in the inference if we already have camera extrinsics and intrinsics. And if we can't do so, whether it is possible to train the model without "Ground Truth Coordinate Normalization".

LeryLee avatar Jul 28 '25 06:07 LeryLee

How can we use this model to obtain the internal and external parameters of the camera?

TDyyds6 avatar Jul 30 '25 02:07 TDyyds6

How can we use this model to obtain the internal and external parameters of the camera?

I noticed that demo_viser.py involves the computation of camera extrinsics and intrinsics.

extrinsic, intrinsic = pose_encoding_to_extri_intri(predictions["pose_enc"], images.shape[-2:])
predictions["extrinsic"] = extrinsic
predictions["intrinsic"] = intrinsic

heMinger avatar Oct 21 '25 09:10 heMinger