mast3r [QUESTION] of inference with the image shape of 256

Thank you for your fancy work!

I am trying to integrate mast3r into some large 3D networks. Due to memory limitations, I can only input images of up to 256x256. Therefore, I would like to know if it is correct to directly change 'true_shape' in input dict to (256, 256) and input images of shape (b, 3, 256, 256) with the published 512x512 pretrained weights? I noticed that doing so did not result in any explicit errors, but the number of matches found decreased dramatically.

Aug 29 '24 20:08 yukiumi13

Hi, I wonder when you say "I noticed that doing so did not result in any explicit errors", do you mean the predicted point map?

Aug 30 '24 09:08 ljjTYJR

Hi. Yes, mast3r could output pts3d with the shape of bx256x256x3, since it would adjust patch size automatically. I found the low performance may be caused a simple NN searching method I used in my own code without iteration. I'm refactoring fast_nn_reciprocal method in mast3r to make it support batched inputs and fully based on torch enabling BP.

Aug 30 '24 10:08 yukiumi13

After implementing the fast_nn method described in MASt3R, we observed a significant increase in correspondences. Additionally, based on some papers I read, it is recommended to use inputs with the shape (B, 3, 224, 224) during training to prevent disturbances in the PE of the ViT. 截屏2024-09-02 21 10 46

Sep 02 '24 13:09 yukiumi13

Which paper mentionds the disturbances of PE? Will it also affect RoPe?

Sep 02 '24 14:09 ljjTYJR

Hello. From my perspective, there has not been any specific paper on 3D Vision discussing this issue. But some papers on Video Generation using ViT have examined it. For example, Extrapolation of Position Code in CogVideo has shown a deteriorated generation quality when changing inference resolution directly.

Sep 02 '24 15:09 yukiumi13

After implementing the fast_nn method described in MASt3R, we observed a significant increase in correspondences. Additionally, based on some papers I read, it is recommended to use inputs with the shape (B, 3, 224, 224) during training to prevent disturbances in the PE of the ViT.

I also observed a similar performance drop with 256*256 resolution and converting images to 224*224 can make dust3r happy. @yukiumi13 Do you think the https://github.com/naver/dust3r/issues/62 strategy may help?

Sep 03 '24 06:09 rwn17