vggt input image shape height

the article recommends that the input image should be square in size. When the input image width>height, padding will be done; when the input image height>width, resize+crop will be done. If the images in the current usage scenario are all height>width, and you want to preserve the integrity of the original image as much as possible, so that i can find the relationship of pix-2d to point-3d, what is the recommended way to process the input image?

rot the picture, make the width > height
center crop the picture, make the shape be square
crop the picture(7201280) to 2 picture (720720), have some overlap

i try three ways, find that effects Method 2 > Method 1 >> Method 3

Mar 19 '25 06:03 dreamer1031

To add, there are a number of issues with image sizes:

The paper shows most examples with different ratios, and the training suggests randomized ratios between 0.33 and 1.0 so not sure why a 1x1 ratio is best.
I tested the code allowing higher resolution (single image), doubling it from W=518xH=294 to 1036x588 (as long as the dimensions multiples of 14 it should work and it did). The results are good, with exception to the focal estimates changing from fx=fy to fx=2.1*fy (which is odd since I expected fx=fy but be double the scale? everything looked good otherwise.

I hope the authors can clarify the resolution issues. I results appear to make clear improvements over dust3r so the community will benefit.

Mar 19 '25 12:03 yaseryacoob

Hi @dreamer1031,

Thanks for your interest! Could you share where the recommendation to use square-sized images appears? I think I did not do that but might have unintentionally written this somewhere, and I'd like to correct it. Our model was trained with random aspect ratios ranging from 0.33 to 1, so it is okay with different aspect ratios.

As shown below, In practice, you only need to resize the input width to 518, and the height can remain flexible. This method has worked well in my tests—please let me know if you encounter any issues.

https://github.com/facebookresearch/vggt/blob/fa5911cb6be0b979b3507564bccdb95144178b0f/vggt/utils/load_fn.py#L12

For scenarios where the image height > width and you aim to preserve the original image integrity (to accurately map pix-2d to point-3d), I recommend:

Resize the image proportionally and pad the width to reach 518. Center cropping isn't necessary. This method should work best if slightly lower resolution output is acceptable.
Rotate images by 90 degrees, as you've tried. Generally, this should not significantly degrade performance (though corner cases might exist). Just ensure consistency—rotating all images uniformly. Mixing orientations (e.g., some rotated left and others right) can substantially impact performance negatively.

Feel free to experiment with these methods. To be honest, I haven't thoroughly tested various resizing or padding options—my suggestions are based on intuition and initial validations. Your experiments indeed suggest certain settings might yield even better performance.

Mar 19 '25 19:03 jytime

Hi @yaseryacoob ,

As discussed above, the recommendation of a 1x1 aspect ratio isn't necessarily a general rule and might be optimal only in specific scenarios. From my personal experiments, I've found that an aspect ratio of around 0.75 usually provides the best qualitative results, though I haven't confirmed this quantitatively.

Regarding your observation with the resolution change (from 518x294 to 1036x588), my initial thought is that this might relate to the field of view (FOV). Our model regresses FOV rather than focal length directly, which could explain why the estimated focal lengths (fx and fy) differ upon resizing. The model might interpret the field of view slightly differently when the input resolution changes.

I'm glad to hear that the model performs well at the higher resolution (1036x588) in your tests. Please feel free to share further observations or feedback—it's very helpful for us and the community!

Thanks again!

Mar 19 '25 19:03 jytime

Thank you very much for your reply！ First of all, I checked and found that neither the article nor the code said that square images are best. I misunderstood it, sorry.. i think "assume that the camera’s principal point is at the image center" leads to the bad results of "crop the picture(7201280) to 2 picture (720720), have some overlap" And i try to write a new load_images funtions to Resize the image proportionally and pad the width to reach 518

def load_and_preprocess_images_new(image_path_list):
    if len(image_path_list) == 0:
        raise ValueError("At least 1 image is required")

    images = []
    shapes = set()
    to_tensor = TF.ToTensor()

    for image_path in image_path_list:
        img = Image.open(image_path)
        if img.mode == 'RGBA':
            background = Image.new('RGBA', img.size, (255, 255, 255, 255))
            img = Image.alpha_composite(background, img)

        img = img.convert("RGB")
        # 按高resize成518
        width, height = img.size
        new_height = 518
        new_width = int(width * (new_height / height))
        img = img.resize((new_width, new_height), Image.Resampling.BICUBIC)
        
        if new_width % 14 != 0:
            pad_length = 14 - new_width % 14
            pad_left = pad_length // 2

            print("pad total:{}, pad_left:{}".format(pad_length, pad_left))
            new_width += pad_length
            # pad
            new_image = Image.new("RGB", (new_width , new_height), color=(0, 0, 0)) # 以黑色填充
            new_image.paste(img, (pad_left, 0))
            img = new_image

        img = to_tensor(img)  # Convert to tensor (0, 1)

        shapes.add((img.shape[1], img.shape[2]))
        images.append(img)

    images = torch.stack(images)  # concatenate images

    # Ensure correct shape when single image
    if len(image_path_list) == 1:
        # Verify shape is (1, C, H, W)
        if images.dim() == 3:
            images = images.unsqueeze(0)

    return images

The results are also good, but the fy almost 1.5*fx, （in fact fx = fy），so that the "world_points_from_depth" is wrong looks like squashed. but the world_points are good...This makes me wonder

Mar 20 '25 03:03 dreamer1031

I may see the problem. During training, all images are trained with a width of 518. So the model will predict the filed of view based on the ratio between height and 518. In this case, I think forcing fx=fy or fy=fx should either work. Please let me know if it does not work for you.

Mar 24 '25 01:03 jytime

I ran some experiments to see how the code performs at higher resolution, here are a few conclusions

You can use the code on higher resolution if you do two passes, first pass with low-resolution 518 by whatever you want (in my case 518x294). The intrinsic and extrinsic matrices can be plugged in for high resolution. So in the second pass, you process the full resolution, in my case x2 and x4 times. In each case the extrinsic matrix from low resolution stays the same for higher resolution, but you need to modify the fx, fy, cx,cy (for 2x the focals are multiplied by 2, and same for cx,cy. In the second pass you use the depth_map with the matrices of the first pass this way # Construct 3D Points from Depth Maps and Cameras which usually leads to more accurate 3D points than point map branch: point_map_by_unprojection = unproject_depth_map_to_point_map(depth_map, extrinsic , intrinsic) This appeared to work as intended, essentially using the depth heads for higher resolution image and camera head from low-resolution image.
The issue is that the depth head at high resolution (2072x1190) is not very accurate (which is rather obvious given the training), see examples below (image, vggt depth and ml-depth-pro metric depth), note this is 4x the recommended 518 resolution, so stretching the limits of algorithm!!

In summary, for my own use on NVS, accuracy at high resolution requires both accurate depth and camera matrices. Other uses maybe ok with low resolution.

Mar 24 '25 12:03 yaseryacoob

Hi @yaseryacoob ,

Thanks for the detailed discussion—the example looks great!

If you’re aiming for multi-view consistency while preserving high resolution, one possible solution is to use our predicted depth maps along with other monocular estimations (e.g., ml-depth-pro) to compute a scale and shift, and then align these monocular predictions accordingly. This approach might yield better results, but it would need to be verified in 3D space.

Mar 24 '25 19:03 jytime

@yaseryacoob Thanks for sharing this practice!

On ETH3D dataset, I followed your suggestions, using the camera extrinsics and FoV output from width=518 input to unproject the high-resolution depth prediction of width=518x2.

However, I didn't obtain accurate high-resolution depth maps as expected. The monocular input view

The GT point cloud (colored), unprojected width=518 depth (yellow), and unprojected width=518x2 depth (green). I used Umeyama to align the unprojected depths with the GT.

The low-resolution point cloud (yellow) is aligned with the GT, but the high-resolution point cloud (green) is not.

Meanwhile, for high-resolution processing, unprojected HR depth maps using LR camera params still works better than the points 3D prediction branch, because the points 3D prediction for HR inputs get more distorted in my example.

Could you please share some visualizations of the x2-resolution depth predictions, perhaps the 3D visualization?

Great thanks!

Apr 07 '25 15:04 ChenYutongTHU

I ran on your image, and overall confirm your observations. So here is what I did, force the image to be 518 and then run on 518*2. The depth images all appear correct, see below, 518 and 1036.

Using the extrinsic matrix from 518 and doubling the intrinsic and unprojecting provides aligned pointclouds only in the middle, but they don't align as well in the double resolution. See the two pointclouds rendering, the window and the the displays are well registered, but the left side is not (exactly like yours). I think the problem is that the intrinsic and extrinsic "trick" is not accurate enough, so one can estimate them again using other approaches that can handle the high resolution directly. (Like Unity Depth). There maybe some distortion in the image that is not account for by the matrices to begin with?

I have been working on multi image high resolution single pointclouds, which has been quite tricky so far.

Apr 07 '25 17:04 yaseryacoob

HI @ChenYutongTHU ,

Are you using the undistorted or original version of ETH3D? The original version of ETH3D was captured by non-pinhole camera and has noticeable distortion. You need to undistort them first. (actually I would be surprised if VGGT can handle distorted images in 518 resolution)

Apr 07 '25 18:04 jytime

@ChenYutongTHU I am using the one you embedded, above. I don't have the dataset.

Apr 07 '25 18:04 yaseryacoob

@yaseryacoob Great thanks. @jytime Yes. I used the undistorted image, which was released here as 'scenename_dslr_undistorted.7z'. The download link for the office scene I used is https://www.eth3d.net/data/office_dslr_undistorted.7z.

Apr 08 '25 11:04 ChenYutongTHU

input image shape height > width