sdfstudio
sdfstudio copied to clipboard
CUDA out of memory upon initiating neuralangelo training
I am a NeRF, conda and python noob, so go easy on me, lol.
I am trying to use sdfstudio to train a custom dataset. The dataset consists of 90 .jpg images with resolution 2028x1520. I processed the images through colmap successfully (from within sdfstudio), but I get constant "CUDA out of memory" errors when I attempt to train the model. I tried training the model directly using the following command (note that the command includes changes to several parameters, i.e. rays per batch and rays per chunk, that I hoped might allow the process to complete):
ns-train neuralangelo --pipeline.datamanager.train-num-rays-per-batch=32 --pipeline.datamanager.eval-num-rays-per-batch=32 --pipeline.model.sdf-field.inside-outside False --vis viewer --viewer.num-rays-per-chunk=512 --experiment-name neuralangelo-xfrmr-small nerfstudio-data --data C:\Users\fjsch\sdfstudio\outputs\xfrmr-small
Here is the error I received:
Setting up training dataset... Caching all 81 images. Setting up evaluation dataset... Caching all 9 images. No checkpoints to load, training from scratch Printing profiling stats, from longest to shortest duration in seconds VanillaPipeline.get_train_loss_dict: 0.7051 Traceback (most recent call last): File "C:\Users\fjsch.conda\envs\sdfstudio\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\fjsch.conda\envs\sdfstudio\lib\runpy.py", line 87, in run_code exec(code, run_globals) File "C:\Users\fjsch.conda\envs\sdfstudio\Scripts\ns-train.exe_main.py", line 7, in
File "C:\Windows\System32\sdfstudio\scripts\train.py", line 250, in entrypoint main( File "C:\Windows\System32\sdfstudio\scripts\train.py", line 236, in main launch( File "C:\Windows\System32\sdfstudio\scripts\train.py", line 175, in launch main_func(local_rank=0, world_size=world_size, config=config) File "C:\Windows\System32\sdfstudio\scripts\train.py", line 90, in train_loop trainer.train() File "C:\Windows\System32\sdfstudio\nerfstudio\engine\trainer.py", line 151, in train loss, loss_dict, metrics_dict = self.train_iteration(step) File "C:\Windows\System32\sdfstudio\nerfstudio\utils\profiler.py", line 43, in wrapper ret = func(*args, **kwargs) File "C:\Windows\System32\sdfstudio\nerfstudio\engine\trainer.py", line 321, in train_iteration self.grad_scaler.scale(loss).backward() # type: ignore File "C:\Users\fjsch.conda\envs\sdfstudio\lib\site-packages\functorch_src\monkey_patching.py", line 77, in _backward return old_backward(*args, **kwargs) File "C:\Users\fjsch.conda\envs\sdfstudio\lib\site-packages\torch_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "C:\Users\fjsch.conda\envs\sdfstudio\lib\site-packages\torch\autograd_init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA out of memory. Tried to allocate 1.67 GiB (GPU 0; 8.00 GiB total capacity; 5.03 GiB already allocated; 366.00 MiB free; 5.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have a 3060Ti video card with 8GB of VRAM and I have 32GB of system memory. I am able to process and train models using far more higher resolution images just fine using plain old nerfstudio...I just can't get sdfstudio to work. Any ideas??? Any input you may be able to provide would be greatly appreciated!
Try BakedSDF, Neuralangelo requires a lot of VRAM.
Hi, you can try to use a small hash grids with --pipeline.model.sdf-field.hash-features-per-level 2 --pipeline.model.sdf-field.log2-hashmap-size 19
. Or you could try bakedangelo
and reduce the number of training rays.
How do I reduce the number of training rays?
Hi, you can try to use a small hash grids with
--pipeline.model.sdf-field.hash-features-per-level 2 --pipeline.model.sdf-field.log2-hashmap-size 19
. Or you could trybakedangelo
and reduce the number of training rays.
But it seems to slow down the convergence? I haven't try --pipeline.model.sdf-field.hash-features-per-level 8 --pipeline.model.sdf-field.log2-hashmap-size 22
which is the default settings, because my 32G VRAM is not enough. I am wondering why it will consume such a large amount of VRAM and whether the convergence is inherently slow?