vggt CUDA out of memory on demo

I want to run demo_colmap.py to get files for gsplat.

my datasets have 40 images, but it got CUDA out of memory. but your results said it can handle 100 images.

And my GPU is 4090 24GB, and i set torch.float16, but still problem:

Arguments: {'scene_dir': '../data/vggt/images', 'seed': 42, 'use_ba': True, 'max_reproj_error': 8.0, 'shared_camera': False, 'camera_type': 'SIMPLE_PINHOLE', 'vis_thresh': 0.2, 'query_frame_num': 8, 'max_query_pts': 4096, 'fine_tracking': True, 'conf_thres_value': 5.0}
Setting seed as: 42
Using device: cuda
Using dtype: torch.float16
Model loaded
Loaded 40 images from ../data/vggt/images
Using cache found in /data/xielangren/.cache/torch/hub/facebookresearch_dinov2_main
/data/xielangren/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/swiglu_ffn.py:51: UserWarning: xFormers is not available (SwiGLU)
  warnings.warn("xFormers is not available (SwiGLU)")
/data/xielangren/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/attention.py:33: UserWarning: xFormers is not available (Attention)
  warnings.warn("xFormers is not available (Attention)")
/data/xielangren/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/layers/block.py:40: UserWarning: xFormers is not available (Block)
  warnings.warn("xFormers is not available (Block)")
For faster inference, consider disabling fine_tracking
Predicting tracks for query frame 0
Traceback (most recent call last):
  File "/data/xielangren/project/vggt/demo_colmap.py", line 299, in <module>
    demo_fn(args)
  File "/data/xielangren/project/vggt/demo_colmap.py", line 156, in demo_fn
    pred_tracks, pred_vis_scores, pred_confs, points_3d, points_rgb = predict_tracks(
  File "/data/xielangren/project/vggt/vggt/dependency/track_predict.py", line 84, in predict_tracks
    pred_track, pred_vis, pred_conf, pred_point_3d, pred_color = _forward_on_query(
  File "/data/xielangren/project/vggt/vggt/dependency/track_predict.py", line 220, in _forward_on_query
    pred_track, pred_vis, _ = predict_tracks_in_chunks(
  File "/data/xielangren/project/vggt/vggt/dependency/vggsfm_utils.py", line 289, in predict_tracks_in_chunks
    fine_pred_track, _, pred_vis, pred_score = track_predictor(
  File "/data/xielangren/miniconda3/envs/llgs_tcnn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/xielangren/miniconda3/envs/llgs_tcnn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/xielangren/project/vggt/vggt/dependency/vggsfm_tracker.py", line 94, in forward
    fine_pred_track, pred_score = refine_track(
  File "/data/xielangren/project/vggt/vggt/dependency/track_modules/track_refine.py", line 137, in refine_track
    fine_pred_track_lists, _, _, query_point_feat = fine_tracker(
  File "/data/xielangren/miniconda3/envs/llgs_tcnn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/xielangren/miniconda3/envs/llgs_tcnn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/xielangren/project/vggt/vggt/dependency/track_modules/base_track_predictor.py", line 104, in forward
    fcorr_fn = CorrBlock(fmaps, num_levels=self.corr_levels, radius=self.corr_radius)
  File "/data/xielangren/project/vggt/vggt/dependency/track_modules/blocks.py", line 276, in __init__
    fmaps_ = fmaps.reshape(B * S, C, H, W)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.01 GiB. GPU

how can i solve it? i am looking for your reply, thanks!!

Jul 06 '25 14:07 XLR-man

Hi I heard this issue before but I cannot help debug this problem as I don't have access to an 4090

Jul 06 '25 21:07 jytime

Hi @XLR-man

Thanks @murlock1000 for sharing the solution. Does this solve your problem?

https://github.com/facebookresearch/vggt/pull/253

Jul 10 '25 16:07 jytime

hi, the solution from @murlock1000 doesn't help me with the same case of running demo_colmap.py on 24GB GPU. Did you solve this problem? @XLR-man

Jul 11 '25 13:07 chang-xinhai

@jytime @Yaenday hi, still cannot solve it on 4090... I try run on A100 GPU , A100 can run it successfully but it only need about 10GB for 40 images. why 24 GB gpu can't run it?

Jul 11 '25 13:07 XLR-man

I guess this is a problem specific to 4090

Jul 13 '25 16:07 jytime

@jytime using 3090 even we can not perform inference for more than 50 images. It gives OOM. Even after solving this.

Jul 14 '25 09:07 engrmusawarali

@jytime @Yaenday hi, still cannot solve it on 4090... I try run on A100 GPU , A100 can run it successfully but it only need about 10GB for 40 images. why 24 GB gpu can't run it?

The peak memory usage is usually much higher than initial memory usage. Try logging memory usage step by step.

Jul 24 '25 06:07 Six-Bit-TX

Try lowering max_query_pts, i.e. python demo_colmap.py --scene_dir=/YOUR/SCENE_DIR/ --use_ba --max_query_pts=2048 --query_frame_num=5 (from the readme). Or you can try running without bundle adjustment.

Aug 05 '25 11:08 fredsukkar

same error, cannot figure it out based on the above solutions.

Aug 14 '25 08:08 luoshuiyue

Hi! Was the same error on the 4090, resolved it by manually casting everything (model and inputs) to float16 instead of using amp, but need to apply @murlock1000 solution: x + pose_embed.to(x.dtype) because all inputs are in half, and pose_embed is generated in inference as a float32 tensor.

Also it seems that flash attention failed to run in my case. I ran inference under the torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.FLASH_ATTENTION) context (restricting any other backend but FA), and I got the error that qkv are in float32 (despite the enabled autocast) so fast attention cannot be applied here. Tried to fix it with by manually casting tensors to bfloat32 inside the attention layer, succeeded to force torch to use flash attention backend, but still got OOM on bfloats for some reason.

Aug 22 '25 21:08 kst179

yes even i ran into the same error i tried reducing the chunk size based on the GPU and it worked out but there will be a slight tradeoff in accuracy as you would be sending few images each time this happened to me on RTX3090 24gb VRAM

Sep 22 '25 14:09 shaurya2524

CUDA out of memory on demo_colmap.py