gaussian-splatting
gaussian-splatting copied to clipboard
MemoryError: bad allocation: cudaErrorMemoryAllocation: out of memory
We are seeing this error;
Traceback (most recent call last):
File "train.py", line 216, in <module>
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
File "train.py", line 35, in training
scene = Scene(dataset, gaussians)
File "C:\gaussian-splatting\scene\__init__.py", line 83, in _init_
self.gaussians.create_from_pcd(scene_info.point_cloud, self.cameras_extent)
File "C:\gaussian-splatting\scene\gaussian_model.py", line 134, in create_from_pcd
dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)
MemoryError: bad allocation: cudaErrorMemoryAllocation: out of memory
Our system is an RTX5000 16gb, dataset for testing are 20 images at 1288x1857. Its the first time testing on this RTX5000 machine.
we see a GPU spike and then this error appears even with this low amount of images.
At the start of training is there some kind of test of the system, is that what the spike is?
We have tested on a very low amount of images, so it should start to attempt to train correct?
The other thing is to try is this suggestion, but we are unsure what values to change to;
I don't have 24 GB of VRAM for training, what do I do? The VRAM consumption is determined by the number of points that are being optimized, which increases over time. If you only want to train to 7k iterations, you will need significantly less. To do the full training routine and avoid running out of memory, you can increase the --densify_grad_threshold, --densification_interval or reduce the value of --densify_until_iter. Note however that this will affect the quality of the result. Also try setting --test_iterations to -1 to avoid memory spikes during testing. If --densify_grad_threshold is very high, no densification should occur and training should complete if the scene itself loads successfully.
Sorry, I changed my mind, this looks like it's clearly a problem with the kNN library at the start. Could you be so kind to provide the data, then I'm quite sure we can fix it.
I have a data set, do you have an email I can send it too?
yes please, [email protected]
shared a folder to that email :) thanks for looking at it
Hi,
I tried the dataset, I don't have a spike, the dataset works for me (it doesn't produce a great quality capture, but it works). So I assume something is off with the system setup... is the RTX5000 the only GPU in this machine?
Hi, yes its the only gpu. We have tested on 3 of the same build machines with RTX5000 and get the same issue. Other same build PCs with same components but with a6000 are fine.
Getting reconstruction of individual hairs on that dataset, quality of the capture looks good here
Ok interesting, we actually have an rtx5000, maybe we can try and reproduce the issue there
Hello,
we are using the V100 32GB GPU in the Cluster but having exactly the same error here,
even trained on the Nerf synthetic dataset, but the error popped out before the training really started.
is it possible any bug from
dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)
Or if you have any hints?
Thanks.
@PeterFWS we never solved the issue here
@henrypearce4D Hi,
the problem is solved from my side by manually installing the submodules,
TORCH_CUDA_ARCH_LIST="6.0 7.0 7.5 8.0 8.6+PTX" pip install submodules/diff-gaussian-rasterization/
TORCH_CUDA_ARCH_LIST="6.0 7.0 7.5 8.0 8.6+PTX" pip install submodules/simple-knn/
I think the problem was mainly a compatibility issue with the Cuda extension.
Best,
@PeterFWS thankyou I will try tomorrow!
@PeterFWS
I tired;
pip uninstall diff-gaussian-rasterization
pip uninstall simple-knn
then,
set TORCH_CUDA_ARCH_LIST=6.0 7.0 7.5 8.0 8.6+PTX
pip install submodules/diff-gaussian-rasterization/
pip install submodules/simple-knn/
but still get the same error;
dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)
MemoryError: bad allocation: cudaErrorMemoryAllocation: out of memory
Any other suggestions?
@henrypearce4D Then I can only suggest using docker... no other ideas...
I also encountered the same problem on the A100 40GB GPU. I tried the above method but it didn’t work. Is there any way to solve it?
I also encountered the same problem on RTX3090 gpu: while the GPU mem is only about 2GB, it will produce OOM at this line
TORCH_CUDA_ARCH_LIST="6.0 7.0 7.5 8.0 8.6+PTX" pip install submodules/diff-gaussian-rasterization/
thanks, it works for me.
try to reintall diff-gaussian-rasterization following this way
Unfortunately the suggested fixes don't work for me, I also tried with different CUDA versions. Is there any update or other solutions that came up?
It might help to redo the whole installation process in a clean environment like docker and test if it works. You could try my Dockerfile
@altaykacan
Unfortunately the suggested fixes don't work for me, I also tried with different CUDA versions. Is there any update or other solutions that came up?
I managed to fix my issue.
After uninstalling the modules I had to also find and delete the built wheels cache for the modules. I noticed the reinstall ran too quickly, and reading the log it installed it from a cached location. So even if you set the cuda arch list, pip will use any cache available it seems, and if that was built before you specified the arch version it will cause the issue again, so delete that before running the install modules command again.
Good luck!