vggt Support for absolute scale using GT poses and intrinsics?

Congratulations on this excellent work! The generalization capability is impressive, especially with the outstanding performance in outdoor scenarios.

Regarding my dataset, I have ground truth intrinsics and poses, and I aim to obtain depth maps and point clouds with absolute scale. I have two potential approaches in mind:

Is it possible to incorporate these two parameters (ground truth intrinsics and poses) into the model to directly obtain depth and point clouds with absolute scale?

Alternatively, we could align the estimated poses with the ground truth poses to obtain a scaling factor, which could then be applied to the estimated depth to achieve absolute-scale depth maps and point clouds.

I would greatly appreciate your suggestions on these approaches. Thank you!

Mar 27 '25 16:03 booker-max

Hi @booker-max ,

Your proposed solutions look great. I would prefer the second option as it does not need to finetune a model. For the first one, you may need to finetune the model with camera information as condition.

Mar 27 '25 21:03 jytime

I also have the same idea as you now. But I prefer an end-to-end approach to take advantage of its real-time performance. As far as I know, currently only dust3r supports GT poses and intrinsics. They have another work pow3r . May be this will help you.

Mar 28 '25 03:03 Livioni

Hi @booker-max ,

Your proposed solutions look great. I would prefer the second option as it does not need to finetune a model. For the first one, you may need to finetune the model with camera information as condition.

Thank you, I'll go try it.

Mar 28 '25 09:03 booker-max

I also have the same idea as you now. But I prefer an end-to-end approach to take advantage of its real-time performance. As far as I know, currently only dust3r supports GT poses and intrinsics. They have another work pow3r . May be this will help you.

Okay, thank you. I'll go check out pow3r, and by the way, could you tell me how you are currently operating to obtain absolute-scale depth maps and point clouds?

Mar 28 '25 09:03 booker-max

I also have the same idea as you now. But I prefer an end-to-end approach to take advantage of its real-time performance. As far as I know, currently only dust3r supports GT poses and intrinsics. They have another work pow3r . May be this will help you.

Okay, thank you. I'll go check out pow3r, and by the way, could you tell me how you are currently operating to obtain absolute-scale depth maps and point clouds?

Hi, I ran into the same problem — have you solved this issue yet?

Apr 11 '25 00:04 shiyao-li

I also have the same idea as you now. But I prefer an end-to-end approach to take advantage of its real-time performance. As far as I know, currently only dust3r supports GT poses and intrinsics. They have another work pow3r . May be this will help you.

Hi, from the dust3r paper, I don't think they support GT poses, and intrinsic because they normalize the points. Is my understanding correct?

May 15 '25 17:05 RunsenXu

I also have the same idea as you now. But I prefer an end-to-end approach to take advantage of its real-time performance. As far as I know, currently only dust3r supports GT poses and intrinsics. They have another work pow3r . May be this will help you.

Hi, from the dust3r paper, I don't think they support GT poses, and intrinsic because they normalize the points. Is my understanding correct?

Hi, Please check this issue maybe helpful for you.

May 16 '25 03:05 Livioni

May I ask if it is now supported to use the known intrinsic parameters and poses of the camera?

May 30 '25 07:05 missTL

Congratulations on this excellent work! The generalization capability is impressive, especially with the outstanding performance in outdoor scenarios.

Regarding my dataset, I have ground truth intrinsics and poses, and I aim to obtain depth maps and point clouds with absolute scale. I have two potential approaches in mind:

Is it possible to incorporate these two parameters (ground truth intrinsics and poses) into the model to directly obtain depth and point clouds with absolute scale?

Alternatively, we could align the estimated poses with the ground truth poses to obtain a scaling factor, which could then be applied to the estimated depth to achieve absolute-scale depth maps and point clouds.

I would greatly appreciate your suggestions on these approaches. Thank you!

Hi, we are also looking into the same problem and also used the second method. May I ask what scaling factor did you get? I was wondering if it is consistent with all samples, do you have findings about that too?

Much appreciated!

Jun 11 '25 15:06 yyypsycheguy

我现在也有和你一样的想法。但我更喜欢端到端方法来利用其实时性能。据我所知，目前只有 dust3r 支持 GT 姿势和内部函数。他们还有另一个工作 pow3r 。也许这会对你有所帮助。

嗨，从 dust3r 论文中，我认为他们不支持 GT 姿势，而且是固有的，因为它们使点标准化。我的理解正确吗？

After carefully reading the alignment code of Dust3R, I found that it did not actually use the GT pose. Similar to the above method, it also uses the method of aligning the actual pose with the estimated pose.

Jun 12 '25 02:06 Zhaoyibinn

祝贺这项出色的工作！泛化能力令人印象深刻，尤其是在户外场景下表现出色。关于我的数据集，我有地面实况内在函数和姿势，我的目标是获得具有绝对比例的深度图和点云。我有两个可能的方法：是否可以将这两个参数（地面实况内在和姿态）合并到模型中，以直接获得具有绝对比例的深度和点云？或者，我们可以将估计的姿势与地面实况姿势对齐以获得比例因子，然后将其应用于估计的深度，以实现绝对比例的深度图和点云。我非常感谢您对这些方法的建议。谢谢！

嗨，我们也在研究同样的问题，并且还使用了第二种方法。请问您得到了什么比例因子？我想知道它是否与所有样本一致，您也有这方面的发现吗？

非常感谢！

I just used a similar approach to Dust3R to register the camera pose output by the model with the actual camera pose. And rigid transformations were applied to the camera and point cloud. But I found that when the camera's internal parameters are accurately estimated, I can achieve good alignment, but when the estimated internal parameters (focal) are inaccurate, there will be some deviation in point cloud alignment. My modifications are: https://github.com/Zhaoyibinn/vggt

Do you have any suggestions? @jytime

Jun 13 '25 07:06 Zhaoyibinn

祝贺这项出色的工作！泛化能力令人印象深刻，尤其是在户外场景下表现出色。关于我的数据集，我有地面实况内在函数和姿势，我的目标是获得具有绝对比例的深度图和点云。我有两个可能的方法：是否可以将这两个参数（地面实况内在和姿态）合并到模型中，以直接获得具有绝对比例的深度和点云？或者，我们可以将估计的姿势与地面实况姿势对齐以获得比例因子，然后将其应用于估计的深度，以实现绝对比例的深度图和点云。我非常感谢您对这些方法的建议。谢谢！

嗨，我们也在研究同样的问题，并且还使用了第二种方法。请问您得到了什么比例因子？我想知道它是否与所有样本一致，您也有这方面的发现吗？非常感谢！

I just used a similar approach to Dust3R to register the camera pose output by the model with the actual camera pose. And rigid transformations were applied to the camera and point cloud. But I found that when the camera's internal parameters are accurately estimated, I can achieve good alignment, but when the estimated internal parameters (focal) are inaccurate, there will be some deviation in point cloud alignment. My modifications are: https://github.com/Zhaoyibinn/vggt

Do you have any suggestions? @jytime

May I ask how to use your code to achieve the purpose of using the known camera intrinsic and extrinsic parameters? Is there a README file? Thank you.

Jun 13 '25 12:06 missTL

Oh, this is my question. Thank you very much for your attention. I have already committed the code with README and the corresponding two Colmap DTU datasets for your reference. Zhaoyibin_VGGT_fork

Jun 15 '25 02:06 Zhaoyibinn

Hi @booker-max ,嗨 @booker-max ， Your proposed solutions look great. I would prefer the second option as it does not need to finetune a model. For the first one, you may need to finetune the model with camera information as condition.您提出的解决方案看起来很棒。我更喜欢第二个选项，因为它不需要微调模型。对于第一个，您可能需要使用相机信息作为条件来微调模型。

Thank you, I'll go try it.谢谢你，我去试试。

Hello, we are also exploring how to use VGGT to obtain real-scale point cloud information of scenes. Is your method currently feasible? We would love to exchange insights and collaborate. Looking forward to your reply.

Sep 24 '25 07:09 Lucky-zi-lin

Hi @booker-max ,嗨 @booker-max ， Your proposed solutions look great. I would prefer the second option as it does not need to finetune a model. For the first one, you may need to finetune the model with camera information as condition.您提出的解决方案看起来很棒。我更喜欢第二个选项，因为它不需要微调模型。对于第一个，您可能需要使用相机信息作为条件来微调模型。

Thank you, I'll go try it.谢谢你，我去试试。

Hello, we are also exploring how to use VGGT to obtain real-scale point cloud information of scenes. Is your method currently feasible? We would love to exchange insights and collaborate. Looking forward to your reply.

Hi, hope this could help. The idea is to use a reference point with known distance. In our case we measured the distance from the camera to the bottom of the floor. You can find it here in our work, in the Running VGGT >> Compute scale factor section. Or they have made a new model with real scaled estimation.

Sep 24 '25 07:09 yyypsycheguy

Oh, this is my question. Thank you very much for your attention. I have already committed the code with README and the corresponding two Colmap DTU datasets for your reference. Zhaoyibin_VGGT_fork

Hi, may I ask a question about the camera intrinsics? As you mentioned that when the camera's internal parameters are accurately estimated, it can achieve good alignment. I wonder if the intrinsics used for alignment are estimated by VGGT?

Oct 21 '25 13:10 Msyu1020

Oh, this is my question. Thank you very much for your attention. I have already committed the code with README and the corresponding two Colmap DTU datasets for your reference. Zhaoyibin_VGGT_fork

Hi, may I ask a question about the camera intrinsics? As you mentioned that when the camera's internal parameters are accurately estimated, it can achieve good alignment. I wonder if the intrinsics used for alignment are estimated by VGGT?

I have tried using GT internal parameters and estimated internal parameters, but after aligning the camera pose, the point cloud cannot be completely aligned

Oct 22 '25 08:10 Zhaoyibinn

I just used a similar approach to Dust3R to register the camera pose output by the model with the actual camera pose. And rigid transformations were applied to the camera and point cloud. But I found that when the camera's internal parameters are accurately estimated, I can achieve good alignment, but when the estimated internal parameters (focal) are inaccurate, there will be some deviation in point cloud alignment. My modifications are: https://github.com/Zhaoyibinn/vggt

Do you have any suggestions? @jytime

Hello, this code repository only shows modifications based on the Kitti dataset; I did not observe the implementation of "register the camera pose output by the model with the actual camera pose. And rigid transformations were applied to the camera and point cloud." Have I overlooked something?

Oct 28 '25 09:10 Shexiaox