Scaffold-GS icon indicating copy to clipboard operation
Scaffold-GS copied to clipboard

High memory usage leads to process being killed during data loading phase

Open plumeri opened this issue 7 months ago • 3 comments

Hi, thanks for your great work on Scaffold-Gaussian!

I'm currently running into a memory issue when loading data. During the data loading phase, the RAM usage spikes very high and causes the process to be killed by the system (OOM).

Here are some relevant details:

Dataset size: 197 images

Image resolution: automatically downscaled to 1.6k

Kill reason: Out of memory:Killed process 217423(python)total-vm:112518872kB, anon-rs:30204444kB,file-rss:kB shmem-rss:3072kB, UID:1000 pgtables:82836kBoom_score_adj:0

Last log message: (egohos) bird@SEUVCL-7020:/workplace/citygs/Scaffold-GS$ bash ./single_train.sh 0 Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] Loading model from: /home/bird/miniconda3/envs/egohos/lib/python3.10/site-packages/lpips/weights/v0.1/vgg.pth found tf board 2025-05-22 20:14:57,532 - INFO: args: Namespace(sh_degree=3, feat_dim=32, n_offsets=10, voxel_size=0.005, update_depth=3, update_init_factor=32, update_hierachy_factor=4, use_feat_bank=False, source_path='data/3dgs_arg', model_path='outputs/3dgs_arg/baseline/2025-05-08_20:14:44', images='images', resolution=-1, white_background=False, data_device='cuda', eval=True, lod=30, appearance_dim=0, lowpoly=False, ds=1, ratio=1, undistorted=False, add_opacity_dist=False, add_cov_dist=False, add_color_dist=False, iterations=30000, position_lr_init=0.0, position_lr_final=0.0, position_lr_delay_mult=0.01, position_lr_max_steps=30000, offset_lr_init=0.01, offset_lr_final=0.0001, offset_lr_delay_mult=0.01, offset_lr_max_steps=30000, feature_lr=0.0075, opacity_lr=0.02, scaling_lr=0.007, rotation_lr=0.002, mlp_opacity_lr_init=0.002, mlp_opacity_lr_final=2e-05, mlp_opacity_lr_delay_mult=0.01, mlp_opacity_lr_max_steps=30000, mlp_cov_lr_init=0.004, mlp_cov_lr_final=0.004, mlp_cov_lr_delay_mult=0.01, mlp_cov_lr_max_steps=30000, mlp_color_lr_init=0.008, mlp_color_lr_final=5e-05, mlp_color_lr_delay_mult=0.01, mlp_color_lr_max_steps=30000, mlp_featurebank_lr_init=0.01, mlp_featurebank_lr_final=1e-05, mlp_featurebank_lr_delay_mult=0.01, mlp_featurebank_lr_max_steps=30000, appearance_lr_init=0.05, appearance_lr_final=0.0005, appearance_lr_delay_mult=0.01, appearance_lr_max_steps=30000, percent_dense=0.01, lambda_dssim=0.2, start_stat=500, update_from=1500, update_interval=100, update_until=15000, min_opacity=0.005, success_threshold=0.8, densify_grad_threshold=0.0002, convert_SHs_python=False, compute_cov3D_python=False, debug=False, ip='127.0.0.1', port=25344, debug_from=-1, detect_anomaly=False, warmup=False, use_wandb=False, test_iterations=[30000], save_iterations=[30000, 30000], quiet=False, checkpoint_iterations=[], start_checkpoint=None, gpu='-1') Backup Finished! 2025-05-22 20:14:58,203 - INFO: Optimizing outputs/3dgs_arg/baseline/2025-05-22_20:14:44 Output folder: outputs/3dgs_arg/baseline/2025-05-22_20:14:44 [22/05 20:14:58] Reading camera 196/196 [22/05 20:14:58] using lod, using eval [22/05 20:14:58] test_cam_infos: 31 [22/05 20:14:58] start fetching data from ply file [22/05 20:14:58] Loading Training Cameras [22/05 20:14:58] [ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K. If this is not desired, please explicitly specify '--resolution/-r' as 1 [22/05 20:14:58] (egohos) bird@SEUVCL-7020:~/workplace/citygs/Scaffold-GS$ Loading Test Cameras [22/05 20:16:14] ./train.sh: line 36: 33040 Killed python train.py --eval -s data/${data} --lod ${lod} --gpu ${gpu} --voxel_size ${vsize} --update_init_factor ${update_init_factor} --appearance_dim ${appearance_dim} --ratio ${ratio} --iterations ${iterations} --port $port -m outputs/${data}/${logdir}/$time

Is this expected behavior? Is there a way to reduce memory usage during the initial camera/data loading?

Thanks for any suggestions or improvements!

plumeri avatar May 23 '25 09:05 plumeri

Could you try to set the device to “cpu” in argument/init.py? Otherwise it will load data to the default gpu 

--------------原始邮件-------------- 发件人:"Yi Ping Lee @.>; 发送时间:2025年5月23日(星期五) 下午5:01 收件人:"city-super/Scaffold-GS" @.>; 抄送:"Subscribed @.***>; 主题:[city-super/Scaffold-GS] High memory usage leads to process being killed during data loading phase (Issue #104)

plumeri created an issue (city-super/Scaffold-GS#104)

Hi, thanks for your great work on Scaffold-Gaussian!

I'm currently running into a memory issue when loading data. During the data loading phase, the RAM usage spikes very high and causes the process to be killed by the system (OOM).

Here are some relevant details:

Dataset size: 197 images

Image resolution: automatically downscaled to 1.6k

Kill reason: Out of memory:Killed process 217423(python)total-vm:112518872kB, anon-rs:30204444kB,file-rss:kB shmem-rss:3072kB, UID:1000 pgtables:82836kBoom_score_adj:0

Last log message: (egohos) @.:/workplace/citygs/Scaffold-GS$ bash ./single_train.sh 0 Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] Loading model from: /home/bird/miniconda3/envs/egohos/lib/python3.10/site-packages/lpips/weights/v0.1/vgg.pth found tf board 2025-05-22 20:14:57,532 - INFO: args: Namespace(sh_degree=3, feat_dim=32, n_offsets=10, voxel_size=0.005, update_depth=3, update_init_factor=32, update_hierachy_factor=4, use_feat_bank=False, source_path='data/3dgs_arg', model_path='outputs/3dgs_arg/baseline/2025-05-08_20:14:44', images='images', resolution=-1, white_background=False, data_device='cuda', eval=True, lod=30, appearance_dim=0, lowpoly=False, ds=1, ratio=1, undistorted=False, add_opacity_dist=False, add_cov_dist=False, add_color_dist=False, iterations=30000, position_lr_init=0.0, position_lr_final=0.0, position_lr_delay_mult=0.01, position_lr_max_steps=30000, offset_lr_init=0.01, offset_lr_final=0.0001, offset_lr_delay_mult=0.01, offset_lr_max_steps=30000, feature_lr=0.0075, opacity_lr=0.02, scaling_lr=0.007, rotation_lr=0.002, mlp_opacity_lr_init=0.002, mlp_opacity_lr_final=2e-05, mlp_opacity_lr_delay_mult=0.01, mlp_opacity_lr_max_steps=30000, mlp_cov_lr_init=0.004, mlp_cov_lr_final=0.004, mlp_cov_lr_delay_mult=0.01, mlp_cov_lr_max_steps=30000, mlp_color_lr_init=0.008, mlp_color_lr_final=5e-05, mlp_color_lr_delay_mult=0.01, mlp_color_lr_max_steps=30000, mlp_featurebank_lr_init=0.01, mlp_featurebank_lr_final=1e-05, mlp_featurebank_lr_delay_mult=0.01, mlp_featurebank_lr_max_steps=30000, appearance_lr_init=0.05, appearance_lr_final=0.0005, appearance_lr_delay_mult=0.01, appearance_lr_max_steps=30000, percent_dense=0.01, lambda_dssim=0.2, start_stat=500, update_from=1500, update_interval=100, update_until=15000, min_opacity=0.005, success_threshold=0.8, densify_grad_threshold=0.0002, convert_SHs_python=False, compute_cov3D_python=False, debug=False, ip='127.0.0.1', port=25344, debug_from=-1, detect_anomaly=False, warmup=False, use_wandb=False, test_iterations=[30000], save_iterations=[30000, 30000], quiet=False, checkpoint_iterations=[], start_checkpoint=None, gpu='-1') Backup Finished! 2025-05-22 20:14:58,203 - INFO: Optimizing outputs/3dgs_arg/baseline/2025-05-22_20:14:44 Output folder: outputs/3dgs_arg/baseline/2025-05-22_20:14:44 [22/05 20:14:58] Reading camera 196/196 [22/05 20:14:58] using lod, using eval [22/05 20:14:58] test_cam_infos: 31 [22/05 20:14:58] start fetching data from ply file [22/05 20:14:58] Loading Training Cameras [22/05 20:14:58] [ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K. If this is not desired, please explicitly specify '--resolution/-r' as 1 [22/05 20:14:58] (egohos) @.:~/workplace/citygs/Scaffold-GS$ Loading Test Cameras [08/05 20:16:14] ./train.sh: line 36: 33040 Killed python train.py --eval -s data/${data} --lod ${lod} --gpu ${gpu} --voxel_size ${vsize} --update_init_factor ${update_init_factor} --appearance_dim ${appearance_dim} --ratio ${ratio} --iterations ${iterations} --port $port -m outputs/${data}/${logdir}/$time

Is this expected behavior? Is there a way to reduce memory usage during the initial camera/data loading?

Thanks for any suggestions or improvements!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

inspirelt avatar May 23 '25 12:05 inspirelt

Could you try to set the device to “cpu” in argument/init.py? Otherwise it will load data to the default gpu 

Thanks for the reply!

I tried setting device="cpu" in arguments/init.py, but unfortunately, the issue persists — the training process still results in an OOM (Out of Memory) crash, likely on the CPU side.

Here are the memory stats at the time of the crash:

RAM: 38.5 GB / 63.7 GB used VRAM (GPU): 0.5 GB / 55.9 GB used

Additionally, I tried reducing the image resolution to 1024, but the memory usage remained almost the same.

Could the issue be related to how image or camera data is being loaded or cached into memory?

For context, I'm running this on WSL2 (Ubuntu 22.04), in case that affects how memory is managed.

plumeri avatar May 23 '25 15:05 plumeri

I think it's a problem with the CPU. You can check the CPU utilization issue.

cskrren avatar Jun 27 '25 08:06 cskrren