Unexpected Tracker Output
I’m using the VGGT track_head to predict 2D point correspondences across a pair of images. When I input 2 images and a single query point, I expect the tracker to return coordinates for that point across 2 frames. However, the output contains coordinates for 2 points across 4 frames, which doesn’t match my input or the expected behavior.
Input: tensor([[349.3802, 86.7870]])
Output: [tensor([[[[349.9198, 85.7759]], [[190.0406, 93.3672]]]]), tensor([[[[349.9198, 85.7759]], [[184.6947, 88.7198]]]]), tensor([[[[349.9198, 85.7759]], [[185.2204, 88.8006]]]]), tensor([[[[349.9198, 85.7759]],[[184.7553, 88.8715]]]])]
Hi, could you provide a full code snippet to reproduce this behavior?
I guess this results from dimension mismatch, e.g., the expected shape of the inputs are detailed here
https://github.com/facebookresearch/vggt/blob/6d361a374ea50b040e93fa68fca0ab2cbee0e7a8/vggt/models/vggt.py#L27-L55
Hi , @jytime Thanks for your quick response! Below is a minimal example that takes 2 synthetic images and 1 query point. Here’s the code:
import sys
import cv2
import numpy as np
import torch
from vggt.models.vggt import VGGT
from vggt.utils.load_fn import load_and_preprocess_images
from vggt.utils.pose_enc import pose_encoding_to_extri_intri
from vggt.utils.geometry import unproject_depth_map_to_point_map
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
model = VGGT()
_URL = "https://huggingface.co/facebook/VGGT-1B/resolve/main/model.pt"
model.load_state_dict(torch.hub.load_state_dict_from_url(_URL))
image_names = [sys.argv[1], sys.argv[2]]
images = load_and_preprocess_images(image_names)
with torch.no_grad():
with torch.cuda.amp.autocast(dtype=dtype):
images = images[None]
aggregated_tokens_list, ps_idx = model.aggregator(images)
query_points = torch.FloatTensor([[100.0, 200.0]])
track_list, vis_score, conf_score = model.track_head(aggregated_tokens_list, images, ps_idx, query_points=query_points[None])
print(track_list)
Output:
[tensor([[[[100.0000, 200.0000]],
[[ 21.4744, 206.2926]]]]), tensor([[[[100.0000, 200.0000]],
[[ 19.0711, 188.1253]]]]), tensor([[[[100.0000, 200.0000]],
[[ 24.1671, 190.6173]]]]), tensor([[[[100.0000, 200.0000]],
[[ 22.7686, 194.9734]]]])]
The reason you get a list of length 4 as output is because the tracking is done in 4 iterations. The last iteration should be the best one. That is a tensor of shape (1, 2, 1, 2) as expected (batch_size, n_images, n_query_points, 2).
You can see iters being set to 4 here: https://github.com/facebookresearch/vggt/blob/588a0a238db3f23f51440ade013e5a4b2c8de6be/vggt/heads/track_head.py#L12-L30
You can see the list being built and returned here. https://github.com/facebookresearch/vggt/blob/588a0a238db3f23f51440ade013e5a4b2c8de6be/vggt/heads/track_modules/base_track_predictor.py#L188-L209