gaussian-splatting icon indicating copy to clipboard operation
gaussian-splatting copied to clipboard

Training stops at the end when the model is being saved

Open K0pasz opened this issue 3 months ago • 2 comments

I use this gaussian splatting tool in Google Colab because I do not have enough VRAM (6GB) on my PC (when I ran it on my PC it always stopped with an error that indicated to me that I do not have enough VRAM). The problem shows when I set the iterations "too" high (e.g 7000), then the training process automatically stops when it tries to save the model and the created splat. Furthermore I have seen a "^C" at the output so it looks like that the command terminates itself somehow.

My colab notebook looks like this:

%cd /content
!git clone --recursive https://github.com/graphdeco-inria/gaussian-splatting
!pip install -q plyfile

%cd /content/gaussian-splatting
!pip install -q /content/gaussian-splatting/submodules/diff-gaussian-rasterization
!pip install -q /content/gaussian-splatting/submodules/simple-knn

from google.colab import drive
drive.mount('/content/drive')

!python train.py -s /content/drive/MyDrive/for_nerf_by_sai_cli/colmap -i /content/drive/MyDrive/for_nerf_by_sai_cli/images -m /content/output --iterations 10000

The output:

2024-04-30 10:03:24.108816: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-30 10:03:24.108868: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-30 10:03:24.116418: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-30 10:03:24.135188: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-30 10:03:25.983543: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Optimizing /content/output
Output folder: /content/output [30/04 10:03:29]
Reading camera 150/150 [30/04 10:03:29]
Loading Training Cameras [30/04 10:03:29]
Loading Test Cameras [30/04 10:03:33]
Number of points at initialisation :  34739 [30/04 10:03:33]
Training progress:  70% 7000/10000 [09:23<07:06,  7.04it/s, Loss=0.0648267]
[ITER 7000] Evaluating test: L1 0.08124514473112006 PSNR 19.291277433696546 [30/04 10:12:59]

[ITER 7000] Evaluating train: L1 0.044808738678693776 PSNR 22.959835433959963 [30/04 10:13:01]

[ITER 7000] Saving Gaussians [30/04 10:13:01]
^C

I tried to save the output into the connected environment's folder but the issue still remains. If I run 5000 or less iterations than the output is saved correctly.

K0pasz avatar Apr 30 '24 11:04 K0pasz

Since it works at 5000 iterations, when the number of points and thus the memory footprint is smaller, I feel it has to do with Google Colab setting a threshold on the size of files it can save. A check for that could be to disable densification and see if it saves at 7000 as expected.

PanagiotisP avatar Apr 30 '24 17:04 PanagiotisP

Check this issue https://github.com/graphdeco-inria/gaussian-splatting/issues/235

GaneshBannur avatar May 03 '24 11:05 GaneshBannur