vggt icon indicating copy to clipboard operation
vggt copied to clipboard

VGGT to identify the trajectory of the camera in the world coordinate system

Open rocket-ycyer opened this issue 7 months ago • 7 comments

I want to use VGGT to identify the trajectory of the camera in the world coordinate system. There are also people moving in the camera's field of view, and the direct identification effect is not good. How can this problem be solved?

rocket-ycyer avatar May 12 '25 10:05 rocket-ycyer

Is this what you are looking for?

https://github.com/facebookresearch/vggt/issues/47

jytime avatar May 12 '25 15:05 jytime

I used the method you mentioned, but the camera trajectory is still not accurate, you can check my code. @jytime

def save_trajectory_as_obj(points, filename): with open(filename, 'w') as f: for p in points: f.write(f"v {p[0]:.6f} {p[1]:.6f} {p[2]:.6f}\n") if len(points) >= 2: for i in range(len(points)-1): f.write(f"l {i+1} {i+2}\n")

model = VGGT.from_pretrained("/data").to(device) image_names = [os.path.join(image_folder, img) for img in sorted(os.listdir(image_folder))][::20] # an interval of 20 frames. images = load_and_preprocess_images(image_names).to(device) with torch.no_grad(): with torch.cuda.amp.autocast(dtype=dtype): images = images[None] # add batch dimension print(images.shape) aggregated_tokens_list, ps_idx = model.aggregator(images) pose_enc = model.camera_head(aggregated_tokens_list)[-1] extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:]) torch.save(extrinsic, "extrinsic.pt") traj = torch.load("extrinsic.pt")[0,:,:,3] save_trajectory_as_obj(traj, 'extrinsic.obj')

Image

ground truth Image

vggt pred Image

rocket-ycyer avatar May 13 '25 11:05 rocket-ycyer

hey i cannot see what happens there without the access to the original images

jytime avatar May 13 '25 12:05 jytime

@jytime Thank you very much!

Image

Image

Image

Image

Image

Image

Image

Image

rocket-ycyer avatar May 13 '25 13:05 rocket-ycyer

Hey, thanks for sharing. It looks like the images were uploaded by stitching low-resolution frames, so I can’t run them directly.

Here are the most plausible issues I can think of:

1.	The visualization script you’re using might not be compatible with the OpenCV camera_from_world convention, which is the coordinate system our predicted cameras follow. If that’s the case, the rendered results will look distorted or messy—just like in your output.
2.	The model might be getting confused by the visual similarity between buildings. You can try running it with just 5 or 10 continuous frames to see if the visualization becomes more consistent.

jytime avatar May 13 '25 15:05 jytime

The reconstruction effect is very strange. In the original video, a person walks straight forward, then turns left, and walks straight again, but the reconstructed version looks like everything is piled up at the origin @jytime Image

Image

rocket-ycyer avatar May 14 '25 01:05 rocket-ycyer

The camera's position looks very unusual. @jytime Image

rocket-ycyer avatar May 14 '25 01:05 rocket-ycyer