desplat icon indicating copy to clipboard operation
desplat copied to clipboard

torch.OutOfMemoryError: CUDA out of memory.

Open houchenfeng opened this issue 5 months ago • 5 comments

I trained using a custon dataset (adding distractors to the DTU data) with the default parameters of OnTheGo, but I encountered an Out of Memory (OOM) error even when downsampling the images to 1/8. The GPU I used is an A800 with 80GB of VRAM.nerfstudio =1.1.4 gsplat =1.0.0. Below is the error message. Could you provide some guidance?

train log: https://gist.github.com/houchenfeng/39925b81976a722e747c34aae3fbf9e6

houchenfeng avatar Jul 30 '25 08:07 houchenfeng

It seems that your splat file has grown to a very large size, 21 million Gaussians. This is the main problem here. Consumer grade GPU's like 4090 can expect to work with up to a few million Gaussians before running into OOM problems.

It looks like something in your custom data is forcing the growing of Gaussians very rapidly. Can you show what changes you have made to DTU? There is also something strange here Splitting 0.9999998574487111 gaussians: 21045054/21045057 since it looks like the Gaussians are trying to grow even more, with almost all being split.

maturk avatar Jul 30 '25 09:07 maturk

It seems that your splat file has grown to a very large size, 21 million Gaussians. This is the main problem here. Consumer grade GPU's like 4090 can expect to work with up to a few million Gaussians before running into OOM problems.

It looks like something in your custom data is forcing the growing of Gaussians very rapidly. Can you show what changes you have made to DTU? There is also something strange here Splitting 0.9999998574487111 gaussians: 21045054/21045057 since it looks like the Gaussians are trying to grow even more, with almost all being split.

In the DTU dataset's Scan24 scene, I introduced randomly sized and colored patches to simulate visual disturbances, with only 30% of the total images containing these interference elements, as illustrated below.

Image Image

houchenfeng avatar Jul 30 '25 11:07 houchenfeng

It seems that your splat file has grown to a very large size, 21 million Gaussians. This is the main problem here. Consumer grade GPU's like 4090 can expect to work with up to a few million Gaussians before running into OOM problems.

It looks like something in your custom data is forcing the growing of Gaussians very rapidly. Can you show what changes you have made to DTU? There is also something strange here Splitting 0.9999998574487111 gaussians: 21045054/21045057 since it looks like the Gaussians are trying to grow even more, with almost all being split.

I set split_screen_size: float = 0.04 (default = 0.05), otherwise the number of Gaussians would rapidly drop to 0 (train log is as shown below). How should this parameter be set reasonably?

train log :https://gist.github.com/houchenfeng/86a6a56cb8238c88df1c44d12625af7b

houchenfeng avatar Jul 30 '25 11:07 houchenfeng

@houchenfeng can you try training with just the splatfacto method with your custom DTU data and see if the problem occurs with the baseline?

maturk avatar Jul 30 '25 15:07 maturk

@houchenfeng can you try training with just the splatfacto method with your custom DTU data and see if the problem occurs with the baseline?

I just tried it, and everything worked fine. The training can proceed normally and takes about 40 minutes. I'll attach my training log later. Actually, I've already tested this dataset with 3DGS(origin github version), 2DGS, PGSR, GOF, SLS, robustNeRF, NeRFonthego, and WildGaussian. Therefore, I don’t really think the way the dataset was created affected the reconstruction process. I suspect that some ranges or thresholds might not be appropriate.

train.log

houchenfeng avatar Jul 30 '25 16:07 houchenfeng