Instant-angelo icon indicating copy to clipboard operation
Instant-angelo copied to clipboard

CUDA out of memory

Open kukumallou opened this issue 1 year ago • 5 comments

First of all, thanks for the contribution. Very nice project. I came across CUDA out of memory when running dense reconstruction (run_neuralangelo-colmap_dense.sh)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.05 GiB (GPU 0; 23.64 GiB total capacity; 19.51 GiB already allocated; 911.19 MiB free; 20.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried to smaller the number of samples per ray from 1024. to 512 and 256 as suggested in the FAQ. But the error message is the same. BTW I have succeeded in running sparse reconstruction script and got correct results. Any idea to fix this problem? Thanks a lot

kukumallou avatar Nov 02 '23 02:11 kukumallou

Would you mind testing the latest version, and replace python launch.py --config configs/neuralangelo-colmap_dense-SH.yaml --gpu 0 --train dataset.root_dir=$INPUT_DIR with python launch.py --config configs/neuralangelo-colmap_dense.yaml --gpu 0 --train dataset.root_dir=$INPUT_DIR in run_neuralangelo-colmap_dense.sh script.

hugoycj avatar Nov 02 '23 03:11 hugoycj

I tried the latest version (Tue Nov 2), but the error still exists. The card I got is a 4090 with 24G memory.

kukumallou avatar Nov 02 '23 07:11 kukumallou

Sorry to bother you. Would you mind to share in which step the out of memory happens and what's the resolution of your images

hugoycj avatar Nov 02 '23 08:11 hugoycj

There're 140 images with resolution 1920x1440. And below is the output log of the script.

---sfm--- Sparse map datasets/cake exist. Aborting ---model_converter--- ---colmap2mvsnet--- Image pair datasets/cake/dense/pair.txt exist. Aborting Number of model parameters: 1162696 load third_party/Vis-MVSNet/pretrained_model/vis/-1 (1, 1, 528, 960): 100%|█████| 140/140 [02:39<00:00, 1.14s/it] ---mvsnet_fusion--- load data: 100%|███| 140/140 [00:01<00:00, 137.01it/s] prob filter: 100%|███ 140/140 [00:00<00:00, 203.46it/s] vis filter and med fusion: 100%|████| 140/140 [00:05<00:00, 27.54it/s] vis filter and ave fusion: 100%|████| 140/140 [00:04<00:00, 31.20it/s] vis filter: 100%|███| 140/140 [00:04<00:00, 30.62it/s] back proj: 100%|████| 140/140 [00:00<00:00, 293.64it/s] Construct combined PCD Estimate normal ---angelo_recon--- Global seed set to 42 Using 16bit native Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used.. Global seed set to 42 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

Loading dense prior from datasets/cake/dense/fused.ply LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params

0 | model | NeuSModel | 28.0 M

28.0 M Trainable params 0 Non-trainable params 28.0 M Total params 55.914 Total estimated model params size (MB) Epoch 0: : 0it [00:00, ?it/s]Update finite_difference_eps to 0.06801176275750971 Traceback (most recent call last): File "launch.py", line 125, in main() File "launch.py", line 114, in main trainer.fit(system, datamodule=dm) File "/home/****/anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, kwargs) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, kwargs) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train self.fit_loop.run() File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, kwargs) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, kwargs) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 194, in advance response = self.trainer._call_lightning_module_hook("on_train_batch_start", batch, batch_idx) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home//Dev/instant-angelo/systems/base.py", line 57, in on_train_batch_start update_module_step(self.model, self.current_epoch, self.global_step) File "/home//Dev/instant-angelo/systems/utils.py", line 351, in update_module_step m.update_step(epoch, global_step) File "/home//Dev/instant-angelo/models/neus.py", line 111, in update_step self.occupancy_grid_bg.every_n_step(step=global_step, occ_eval_fn=occ_eval_fn_bg, occ_thre=self.config.get('grid_prune_occ_thre_bg', 0.01)) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/nerfacc/grid.py", line 271, in every_n_step self._update( File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/nerfacc/grid.py", line 229, in _update occ = occ_eval_fn(x).squeeze(-1) File "/home//Dev/instant-angelo/models/neus.py", line 104, in occ_eval_fn_bg density, _ = self.geometry_bg(x) File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home//Dev/instant-angelo/models/geometry.py", line 125, in forward out = self.encoding_with_network(points.view(-1, self.n_input_dims)).view(*points.shape[:-1], self.n_output_dims).float() File "/home//anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home//Dev/instant-angelo/models/network_utils.py", line 193, in forward return self.network(self.encoding(x)) File "/home/*/anaconda3/envs/objmodel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home//Dev/instant-angelo/models/network_utils.py", line 76, in forward return self.encoding(x, *args) if not self.include_xyz else torch.cat([x * self.xyz_scale + self.xyz_offset, self.encoding(x, *args)], dim=-1) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 23.64 GiB total capacity; 19.97 GiB already allocated; 970.44 MiB free; 20.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Epoch 0: : 0it [00:07, ?it/s] start time: 2023-11-03 08:46:23 sfm time: 2023-11-03 08:46:23 model_converter finished: 2023-11-03 08:46:24 colmap2mvsnet finished: 2023-11-03 08:46:25 mvsnet_inference finished: 2023-11-03 08:49:06 mvsnet_fusion finished: 2023-11-03 08:49:33 angelo_recon finished: 2023-11-03 08:50:11

kukumallou avatar Nov 03 '23 00:11 kukumallou

Hi, I have decreased the model.num_samples_per_ray from 1024 to 128, but still encountered vram OOM issues. I'm using a 2070 with 8G vram, can I run this project by adjusting other parameters?

lyupei avatar Nov 08 '23 01:11 lyupei

@kukumallou @lyupei Wondering if you guys have solved the problem, it seems that OOM will happen if the VRAM is below 11G. Like 3080 with 10 VRAM.

jianghr-shanghaitech avatar Oct 22 '24 09:10 jianghr-shanghaitech