Expected behavior on portrait images?
Hi, I was experimenting with integrating the VGGT code into my custom training pipeline, and I noticed some strange behavior when running the model on portrait images.
Specifically, I passed the same portrait image through the aggregator and then into either the depth head or the point head. I followed the sample code to obtain the intrinsics matrix by extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:]). After unprojecting the depth map to a point cloud, I noticed that the aspect ratio of the point cloud looks incorrect, and it does not line up with the point cloud obtained from the point head (red and yellow points respectively).
If I transposed the image so that it is in landscape before passing it into the model, then output point clouds from the two heads do line up correctly.
Moreover, even though the point cloud from the point head in the first case (portrait) looks plausible, it does not align very well with the ground truth point cloud I have, while both point clouds from the second case (transposed to landscape) align with the ground truth much better.
I'm wondering if this is the expected behavior and that I should ensure all input image are in landscape, or if it might be caused by some other bug in my code. Thanks!
Met the similar problem, seems like during training the dataloader forces images to be in landscape (https://github.com/facebookresearch/vggt/blob/training/training/data/datasets/co3d.py). Is this true for all datasets?
Maybe it's the intrinsic error from the camera head.