vggt icon indicating copy to clipboard operation
vggt copied to clipboard

Evaluating the pose on ScanNet

Open kk6398 opened this issue 7 months ago • 8 comments

Similar to the issues with #37 and #70, I would like to test the pose of the Scannet dataset. My current test results are as follows:

I test the pose on scene0011_00 with different interal and nums (interval and num represent the interval of sampling and the number of the frames respectively):

Image

Image

Are these test results normal, and I found that the result of "align=True" and "Align=False", the ATE and ARE have large difference. So what is the main reason?

Sincerely hope to get your reply!

Best Wishes

kk6398 avatar May 18 '25 08:05 kk6398

Hi,

I am not sure which codebase you are using, but align=True/False sounds like whether conducting alignment to the predicted poses. Our predicted camera poses are in a normalised unit space while the gt of scannet may be in a metric scale. So if you do not align them, the results will be quite weird.

At the same time, please double check you are using opencv camera from world convention.

jytime avatar May 18 '25 20:05 jytime

Hi,

I am not sure which codebase you are using, but align=True/False sounds like whether conducting alignment to the predicted poses. Our predicted camera poses are in a normalised unit space while the gt of scannet may be in a metric scale. So if you do not align them, the results will be quite weird.

At the same time, please double check you are using opencv camera from world convention.

The author of #37 provide the code https://github.com/VladimirYugay/vggt_inference/blob/main/vggt_eval_scannet.py, which eval the pose on video. So, I modified this code by evalating the consistive image frames, the detailed code is as follows: pose_eval

I have tried to align the camera2world pose as possible, but it seems like some errors..... And, I checked the point cloud results and it is accurate. Could you help me figure out where the problem is? Thanks a lot Thanks a lot Thanks a lot

kk6398 avatar May 19 '25 01:05 kk6398

@kk6398 I am facing the similar problem, did you solve this!

engrmusawarali71 avatar May 19 '25 11:05 engrmusawarali71

@kk6398 I am facing the similar problem, did you solve this!

It has not been solved yet. What methods and metrics do you use for evaluation? What is your evaluation result? We can discuss it further.

kk6398 avatar May 19 '25 11:05 kk6398

@kk6398 Hello, I am trying to evaluate the camera poses estimated by VGGT with the camera poses obtained with Charuco board. I tried to to align the estimated trajectory with the ground truth trajectory using umeyama method, however always the scaled trajectory becomes larger. I can not figure out the issue.

engrmusawarali71 avatar May 19 '25 11:05 engrmusawarali71

@kk6398 Hello, I am trying to evaluate the camera poses estimated by VGGT with the camera poses obtained with Charuco board. I tried to to align the estimated trajectory with the ground truth trajectory using umeyama method, however always the scaled trajectory becomes larger. I can not figure out the issue.

Sorry, I still don't have a good solution, and I'm still exploring. The author give the suggestion to consider the scale between pose evaluation of VGGT and GT pose, as you can see in #64 . On the other hand, as you mentioned the umeyama method, what is the test result when you test at 10 frames or 20 frames input?

If you have any progress, please share with me and I hope can help each other! Thank you!

kk6398 avatar May 19 '25 13:05 kk6398

@jytime

Hi,

I am not sure which codebase you are using, but align=True/False sounds like whether conducting alignment to the predicted poses. Our predicted camera poses are in a normalised unit space while the gt of scannet may be in a metric scale. So if you do not align them, the results will be quite weird.

At the same time, please double check you are using opencv camera from world convention.

I calculate the scene scale on ScanNet is close to 1( such as 1.012/0.997 ). Then I multiplied the translation part of gt_pose with the scale to obtain pose_scale () extrinsics_gt[:, :, :3, 3] = extrinsics_gt[:, :, :3, 3] * avg_scale.view(-1, 1, 1). Due to the scale being close to 1, the results are not significantly different. May I ask if my approach is correct?

kk6398 avatar May 20 '25 03:05 kk6398

It seems the api used in https://github.com/VladimirYugay/vggt_inference/blob/main/vggt_eval_scannet.py has already considered this. Its source is from https://github.com/MichaelGrupp/evo/blob/86f52ade6da8cc4749c6170b1d2771ea1e0f1c66/evo/main_ape.py#L42, as

def ape(traj_ref: PosePath3D, traj_est: PosePath3D,
        pose_relation: metrics.PoseRelation, align: bool = False,
        correct_scale: bool = False, n_to_align: int = -1,
        align_origin: bool = False, ref_name: str = "reference",
        est_name: str = "estimate",
        change_unit: typing.Optional[metrics.Unit] = None,
        project_to_plane: typing.Optional[Plane] = None) -> Result:

In my wild guess (did not really check the code of evo), you just need to ensure: align=True, correct_scale=True, align_origin=True

The result should be not wrong.

jytime avatar May 20 '25 03:05 jytime