vggt VGGT to identify the trajectory of the camera in the world coordinate system

I want to use VGGT to identify the trajectory of the camera in the world coordinate system. There are also people moving in the camera's field of view, and the direct identification effect is not good. How can this problem be solved?

May 12 '25 10:05 rocket-ycyer

Is this what you are looking for?

https://github.com/facebookresearch/vggt/issues/47

May 12 '25 15:05 jytime

I used the method you mentioned, but the camera trajectory is still not accurate, you can check my code. @jytime

def save_trajectory_as_obj(points, filename): with open(filename, 'w') as f: for p in points: f.write(f"v {p[0]:.6f} {p[1]:.6f} {p[2]:.6f}\n") if len(points) >= 2: for i in range(len(points)-1): f.write(f"l {i+1} {i+2}\n")

model = VGGT.from_pretrained("/data").to(device) image_names = [os.path.join(image_folder, img) for img in sorted(os.listdir(image_folder))][::20] # an interval of 20 frames. images = load_and_preprocess_images(image_names).to(device) with torch.no_grad(): with torch.cuda.amp.autocast(dtype=dtype): images = images[None] # add batch dimension print(images.shape) aggregated_tokens_list, ps_idx = model.aggregator(images) pose_enc = model.camera_head(aggregated_tokens_list)[-1] extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:]) torch.save(extrinsic, "extrinsic.pt") traj = torch.load("extrinsic.pt")[0,:,:,3] save_trajectory_as_obj(traj, 'extrinsic.obj')

ground truth

vggt pred

May 13 '25 11:05 rocket-ycyer

hey i cannot see what happens there without the access to the original images

May 13 '25 12:05 jytime

@jytime Thank you very much！

May 13 '25 13:05 rocket-ycyer

Hey, thanks for sharing. It looks like the images were uploaded by stitching low-resolution frames, so I can’t run them directly.

Here are the most plausible issues I can think of:

1.	The visualization script you’re using might not be compatible with the OpenCV camera_from_world convention, which is the coordinate system our predicted cameras follow. If that’s the case, the rendered results will look distorted or messy—just like in your output.
2.	The model might be getting confused by the visual similarity between buildings. You can try running it with just 5 or 10 continuous frames to see if the visualization becomes more consistent.

May 13 '25 15:05 jytime

The reconstruction effect is very strange. In the original video, a person walks straight forward, then turns left, and walks straight again, but the reconstructed version looks like everything is piled up at the origin @jytime

May 14 '25 01:05 rocket-ycyer

The camera's position looks very unusual. @jytime

May 14 '25 01:05 rocket-ycyer