VGGT to identify the trajectory of the camera in the world coordinate system
I want to use VGGT to identify the trajectory of the camera in the world coordinate system. There are also people moving in the camera's field of view, and the direct identification effect is not good. How can this problem be solved?
Is this what you are looking for?
https://github.com/facebookresearch/vggt/issues/47
I used the method you mentioned, but the camera trajectory is still not accurate, you can check my code. @jytime
def save_trajectory_as_obj(points, filename): with open(filename, 'w') as f: for p in points: f.write(f"v {p[0]:.6f} {p[1]:.6f} {p[2]:.6f}\n") if len(points) >= 2: for i in range(len(points)-1): f.write(f"l {i+1} {i+2}\n")
model = VGGT.from_pretrained("/data").to(device) image_names = [os.path.join(image_folder, img) for img in sorted(os.listdir(image_folder))][::20] # an interval of 20 frames. images = load_and_preprocess_images(image_names).to(device) with torch.no_grad(): with torch.cuda.amp.autocast(dtype=dtype): images = images[None] # add batch dimension print(images.shape) aggregated_tokens_list, ps_idx = model.aggregator(images) pose_enc = model.camera_head(aggregated_tokens_list)[-1] extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:]) torch.save(extrinsic, "extrinsic.pt") traj = torch.load("extrinsic.pt")[0,:,:,3] save_trajectory_as_obj(traj, 'extrinsic.obj')
ground truth
vggt pred
hey i cannot see what happens there without the access to the original images
@jytime Thank you very much!
Hey, thanks for sharing. It looks like the images were uploaded by stitching low-resolution frames, so I can’t run them directly.
Here are the most plausible issues I can think of:
1. The visualization script you’re using might not be compatible with the OpenCV camera_from_world convention, which is the coordinate system our predicted cameras follow. If that’s the case, the rendered results will look distorted or messy—just like in your output.
2. The model might be getting confused by the visual similarity between buildings. You can try running it with just 5 or 10 continuous frames to see if the visualization becomes more consistent.
The reconstruction effect is very strange. In the original video, a person walks straight forward, then turns left, and walks straight again, but the reconstructed version looks like everything is piled up at the origin @jytime
The camera's position looks very unusual. @jytime