vggt training process

Thank you for the great work you have done on VGGT. I have some questions regarding the training process:

Did all of the dozen or so datasets you used for training include depth maps? If not, how were the datasets without depth maps used for training?
When training the CO3D dataset, there is a parameter “point_masks” that requires depth maps to generate, which is used to filter valid camera loss points. Is it necessary to have depth maps to calculate camera loss and train camera pose prediction? I would greatly appreciate it if you could take the time to answer these questions!

Oct 10 '25 07:10 LiuChunlin-eng

Hi, I am not the author so please take this with a grain of salt.

My understanding is that all datasets need to include depth maps. For those where they did not exist (e.g. DL3DV-10K), they created their own depth maps by running some pipeline (e.g. using some off-the-shelf depth estimator or a reconstruction pipeline)

Oct 12 '25 09:10 davnords

Hi, I am not the author so please take this with a grain of salt.

My understanding is that all datasets need to include depth maps. For those where they did not exist (e.g. DL3DV-10K), they created their own depth maps by running some pipeline (e.g. using some off-the-shelf depth estimator or a reconstruction pipeline)

Thank you for your answer, I noticed that the author previously replied to you, stating, 'The normalized depth maps will vary across datasets.' and I have recently been using some existing depth estimation methods to estimate depth maps for training.

Oct 13 '25 09:10 LiuChunlin-eng