Is it possible to process 1000+ images on a 80G GPU?
Hi, thanks for your time! Is there a way to reduce VGTT's memory usage, like using a low-resolution checkpoint or similar?
Hi @y6216886 ,
We did not implement this ourselves, but you can
(1) Run VGGT over different image subsets and then align all of them. (2) Distribute the inference over multiple GPUs, like what has been done in https://github.com/facebookresearch/fast3r
@jytime thanks again for your work.
I'm currently trying option 1, and I tested some approaches but never with good results:
-
Overlap of 1: between two subsets there is one overlapping image. Since for each subset the first image is the reference, apply the transform of last image of the first subset to the second subset. It seems ok but there are considerable errors (by overlapping 10, I use the 1st to compute the overlap and the remaining 9 to compute the error)
-
Overlap 50% percent: use 50% overlap between subsets, and try to align using mean square errors over the poses of the cameras to overlap. Seems to work less; there is no guarantee that the relative transforms between the same images in consecutive batches are similar. I tried to use least square error for rigid transform (rot and translation) and had some "ok" results (algorithm works, but still high error). Also tried to estimate a scale factor, but with that it stopped converging.
In your opinion what would be the best approach? Have you ever tested something similar ?
Moreover, I couldn't find in your paper, but is it a valid assumption that 10 images in different subsets will have the same relative transforms? Assuming same order.
I hope you can answer some of my questions, thanks again!
Hi @bidbest ,
The predictions from VGGT exist within normalized coordinate spaces, but each subset resides in its own separate normalized space. Therefore, while you can assume these spaces are similar, it's not accurate to assume "10 images in different subsets will have the same relative transforms". In more details, this is valid for rotation, but there is a scale and shift difference regarding translation vectors.
For alignment purposes, your option 2 (overlapping subsets by about 50%) is indeed a sensible approach. However, to achieve accurate alignment, you should explicitly estimate the relative transform between subsets by registering the point cloud from subset A to the point cloud from subset B, i.e., point cloud registration, which can be done by closed form solutions such as Umeyama's method.
You might explore built-in functions or tools such as those provided by Dust3r or similar methods for efficient registration. However, I haven't personally verified these particular implementations yet.
FYI:
https://github.com/facebookresearch/vggt/issues/37
@jytime thank you very much for your reply.
By looking at the thread you suggested, it should be possible to use the camera poses themselves to determine the scale normalisation?
If I understand correctly, all subset will have their own normalized translations. Is this normalisation a single parameter, or one per dimension? (X, Y, Z)
By having enough overlapping images (10-20) it should be possible to infer such normalisation parameter/s from the poses themselves right?
Ideally I would prefer to find such normalisation from the camera poses themselves, rather than the pointclouds in order to have an approach that doesn't depend on the scene itself.
If this is possible, I really believe this approach can fully replace classic sfm! Thanks again for your work and your replies!
Hi yes I think it is also possible to use the camera poses to find the scale, although did not try it myself before. If we can assume different scenes share the same scene center, it is okay to use only single parameter. But this assumption is usually too harsh, so probably best to start from 3 parameters. Yeah it should be enough to infer them from 10-20 overlapped images, while the accuracy may need to be double checked.
Thanks for your kind words :)
@bidbest Hi, I am trying to do a similar thing. From my experience #37 approach works for camera alignment between different batches. But how are you aligning the point cloud? Could you please share more details about your approach or code?