gaussian-splatting MemoryError: bad allocation: cudaErrorMemoryAllocation: out of memory

We are seeing this error;

Traceback (most recent call last):
  File "train.py", line 216, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "train.py", line 35, in training
    scene = Scene(dataset, gaussians)
  File "C:\gaussian-splatting\scene\__init__.py", line 83, in _init_
    self.gaussians.create_from_pcd(scene_info.point_cloud, self.cameras_extent)
  File "C:\gaussian-splatting\scene\gaussian_model.py", line 134, in create_from_pcd
    dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)
MemoryError: bad allocation: cudaErrorMemoryAllocation: out of memory

Our system is an RTX5000 16gb, dataset for testing are 20 images at 1288x1857. Its the first time testing on this RTX5000 machine.

we see a GPU spike and then this error appears even with this low amount of images.

Aug 18 '23 11:08 henrypearce4D

At the start of training is there some kind of test of the system, is that what the spike is?

We have tested on a very low amount of images, so it should start to attempt to train correct?

The other thing is to try is this suggestion, but we are unsure what values to change to;

I don't have 24 GB of VRAM for training, what do I do? The VRAM consumption is determined by the number of points that are being optimized, which increases over time. If you only want to train to 7k iterations, you will need significantly less. To do the full training routine and avoid running out of memory, you can increase the --densify_grad_threshold, --densification_interval or reduce the value of --densify_until_iter. Note however that this will affect the quality of the result. Also try setting --test_iterations to -1 to avoid memory spikes during testing. If --densify_grad_threshold is very high, no densification should occur and training should complete if the scene itself loads successfully.

Aug 22 '23 10:08 henrypearce4D

Sorry, I changed my mind, this looks like it's clearly a problem with the kNN library at the start. Could you be so kind to provide the data, then I'm quite sure we can fix it.

Aug 23 '23 15:08 Snosixtyboo

I have a data set, do you have an email I can send it too?

Aug 23 '23 16:08 henrypearce4D

yes please, [email protected]

Aug 23 '23 16:08 Snosixtyboo

shared a folder to that email :) thanks for looking at it

Aug 24 '23 10:08 henrypearce4D

Hi,

I tried the dataset, I don't have a spike, the dataset works for me (it doesn't produce a great quality capture, but it works). So I assume something is off with the system setup... is the RTX5000 the only GPU in this machine?

Aug 27 '23 20:08 Snosixtyboo

Hi, yes its the only gpu. We have tested on 3 of the same build machines with RTX5000 and get the same issue. Other same build PCs with same components but with a6000 are fine.

Getting reconstruction of individual hairs on that dataset, quality of the capture looks good here

Troll-GithubV4

Aug 29 '23 16:08 henrypearce4D

Ok interesting, we actually have an rtx5000, maybe we can try and reproduce the issue there

Aug 30 '23 01:08 Snosixtyboo

Hello,

we are using the V100 32GB GPU in the Cluster but having exactly the same error here,

even trained on the Nerf synthetic dataset, but the error popped out before the training really started.

is it possible any bug from

dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)

Or if you have any hints?

Thanks.

Oct 13 '23 14:10 PeterFWS

@PeterFWS we never solved the issue here

Oct 13 '23 23:10 henrypearce4D

@henrypearce4D Hi,

the problem is solved from my side by manually installing the submodules,

TORCH_CUDA_ARCH_LIST="6.0 7.0 7.5 8.0 8.6+PTX" pip install submodules/diff-gaussian-rasterization/   

TORCH_CUDA_ARCH_LIST="6.0 7.0 7.5 8.0 8.6+PTX" pip install submodules/simple-knn/

I think the problem was mainly a compatibility issue with the Cuda extension.

Best,

Oct 19 '23 20:10 PeterFWS

@PeterFWS thankyou I will try tomorrow!

Oct 19 '23 20:10 henrypearce4D

@PeterFWS

I tired;

pip uninstall diff-gaussian-rasterization
pip uninstall simple-knn

then,

set TORCH_CUDA_ARCH_LIST=6.0 7.0 7.5 8.0 8.6+PTX
pip install submodules/diff-gaussian-rasterization/
pip install submodules/simple-knn/

but still get the same error;

    dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)
MemoryError: bad allocation: cudaErrorMemoryAllocation: out of memory

Any other suggestions?

Oct 20 '23 15:10 henrypearce4D

@henrypearce4D Then I can only suggest using docker... no other ideas...

Oct 20 '23 15:10 PeterFWS

I also encountered the same problem on the A100 40GB GPU. I tried the above method but it didn’t work. Is there any way to solve it?

Dec 06 '23 06:12 kkhao

I also encountered the same problem on RTX3090 gpu: while the GPU mem is only about 2GB, it will produce OOM at this line

Jan 04 '24 15:01 mengxuyiGit

TORCH_CUDA_ARCH_LIST="6.0 7.0 7.5 8.0 8.6+PTX" pip install submodules/diff-gaussian-rasterization/

thanks, it works for me.

try to reintall diff-gaussian-rasterization following this way

Jan 26 '24 05:01 xjli360

Unfortunately the suggested fixes don't work for me, I also tried with different CUDA versions. Is there any update or other solutions that came up?

Mar 03 '24 11:03 altaykacan

It might help to redo the whole installation process in a clean environment like docker and test if it works. You could try my Dockerfile

Mar 14 '24 00:03 Ali-Flt

@altaykacan

Unfortunately the suggested fixes don't work for me, I also tried with different CUDA versions. Is there any update or other solutions that came up?

I managed to fix my issue.

After uninstalling the modules I had to also find and delete the built wheels cache for the modules. I noticed the reinstall ran too quickly, and reading the log it installed it from a cached location. So even if you set the cuda arch list, pip will use any cache available it seems, and if that was built before you specified the arch version it will cause the issue again, so delete that before running the install modules command again.

Good luck!

May 07 '24 12:05 henrypearce4D

gaussian-splatting gaussian-splatting copied to clipboard

MemoryError: bad allocation: cudaErrorMemoryAllocation: out of memory

gaussian-splatting
gaussian-splatting copied to clipboard