Question about demo_colmap.py
Hi, I noticed that in demo_colmap.py, the code uses:
pose_enc = model.camera_head(aggregated_tokens_list)[-1]
depth_map, depth_conf = model.depth_head(aggregated_tokens_list, images, ps_idx)
instead of simply calling
predictions = model(images)
Could you please explain the reason for explicitly invoking camera_head and depth_head instead of using the full forward pass?
Also, when I try to call model.camera_head and depth_head manually as shown, I encounter out-of-memory (OOM) issues on my GPU. Is there a recommended way to avoid this, or is it more efficient to use the unified model(images) interface?
Thank you very much!
@lycooool
-
I think calling
camera_headanddepth_headseparately instead of the simplerpredictions = model(images)might be because this demo doesn't requirepredict_world_points, which is invoked at this line. -
Regarding the OOM issue, I believe there's no significant difference between the two inference approaches in terms of memory usage. Have you tried setting a smaller
frames_chunk_sizewhen runningdepth_head?
If the issue still persists, maybe you could try chunking the token sequences in the linear layers of the attention module (see here, here and here). I’ve tested this and it significantly improved the maximum frame sequence length the model could handle.
@duybui1911 Thank you for your kind reply!
I’ve tried adjusting the frame_chunk_size, but unfortunately it didn’t seem to make a significant difference.
My GPU is an A5000 with 24 GB of VRAM.
In demo_colmap.py, the shape of images is [1, 62, 3, 518, 518].
Given this configuration, do you think encountering an OOM is expected?
Additionally, if I were to chunk the token sequences within the linear layers of the attention module, would you have any suggestions or best practices for implementing this?
Thanks again!