vggt Question about demo

Hi, I noticed that in demo_colmap.py, the code uses: pose_enc = model.camera_head(aggregated_tokens_list)[-1] depth_map, depth_conf = model.depth_head(aggregated_tokens_list, images, ps_idx) instead of simply calling predictions = model(images)

Could you please explain the reason for explicitly invoking camera_head and depth_head instead of using the full forward pass?

Also, when I try to call model.camera_head and depth_head manually as shown, I encounter out-of-memory (OOM) issues on my GPU. Is there a recommended way to avoid this, or is it more efficient to use the unified model(images) interface?

Thank you very much!

May 29 '25 06:05 lycooool

@lycooool

I think calling camera_head and depth_head separately instead of the simpler predictions = model(images) might be because this demo doesn't require predict_world_points, which is invoked at this line.
Regarding the OOM issue, I believe there's no significant difference between the two inference approaches in terms of memory usage. Have you tried setting a smaller frames_chunk_size when running depth_head?
If the issue still persists, maybe you could try chunking the token sequences in the linear layers of the attention module (see here, here and here). I’ve tested this and it significantly improved the maximum frame sequence length the model could handle.

May 29 '25 07:05 duybui1911

@duybui1911 Thank you for your kind reply!

I’ve tried adjusting the frame_chunk_size, but unfortunately it didn’t seem to make a significant difference.

My GPU is an A5000 with 24 GB of VRAM. In demo_colmap.py, the shape of images is [1, 62, 3, 518, 518]. Given this configuration, do you think encountering an OOM is expected?

Additionally, if I were to chunk the token sequences within the linear layers of the attention module, would you have any suggestions or best practices for implementing this?

Thanks again!

May 29 '25 08:05 lycooool

Question about demo_colmap.py