4DGaussians icon indicating copy to clipboard operation
4DGaussians copied to clipboard

Loss become Nan after some time training

Open HenghuiB opened this issue 1 year ago • 8 comments

Screenshot from 2023-11-18 11-32-03 And the rendering image is white background

HenghuiB avatar Nov 18 '23 19:11 HenghuiB

Wow... I also found same problem during optimization. Initialiy I think it's error on my training machine. Most of cases happen on scenes have more background points such as flame_salmon_1 and coffee_martini of the Neu3D datasets. I think it may be the nemerical overflow during training. Do you have any ideas? I hope we can solve it together if you have time :)

guanjunwu avatar Nov 21 '23 06:11 guanjunwu

I also encountered this problem when training on my own scene, the loss may become nan after several iterations in fine stage. Besides, there are also some cases that "Runtime Error: numel: integer multiplication overflow" happens during fine stage training. I am not sure if it is caused by similar reason.

Arisilin avatar Nov 24 '23 02:11 Arisilin

I meet the same problem, on a colmap-format dataset.

image

The PSNR suddenly decrease into an unexpected value(4.28), while the number of point cloud also decreases .

leo-frank avatar Dec 18 '23 00:12 leo-frank

I guess that maybe the scene's bounding box is so large, and causes the error when producing the backpropagation in the Gaussian deformation field network.

guanjunwu avatar Dec 19 '23 04:12 guanjunwu

I guess that maybe the scene's bounding box is so large, and causes the error when producing the backpropagation in the Gaussian deformation field network.

Is there any solution to solve this problem?

GotFusion avatar Feb 25 '24 12:02 GotFusion

In my test, set no_dr=True and no_ds=True (disable the deformation of rotation and scaling) will decrease the happening of the problem.

guanjunwu avatar Feb 25 '24 16:02 guanjunwu

In my test, set no_dr=True and no_ds=True (disable the deformation of rotation and scaling) will decrease the happening of the problem.

However, it seems that performance might be significantly affected by this approach. Are there any other solutions?

zhaohaoyu376 avatar Mar 27 '24 12:03 zhaohaoyu376

Why do I always run again because the loss is nan? I can't even finish running it once.

zhaohaoyu376 avatar Mar 27 '24 13:03 zhaohaoyu376