stable-dreamfusion
stable-dreamfusion copied to clipboard
I'm using image to 3d functiona and i'm getting NaNs issues.
Description
I am using RTX 3060 with 12GB vRAM, cuda 11.6 pytorch 1.13.1 ubuntu 20.04
Steps to Reproduce
When i input script: main.py -O --image ./data/mushroom_rgba.png --workspace mushroom --save_mesh --iters 5000
After Epoch 17 i will get NaN or Inf found in input tensor
everytime.
Is that a normal situation or might be some problem?
Start Training mushroom Epoch 17/50, lr=0.050000 ... loss=1.0718 (1.0801), lr=0.050000: : 100% 100/100 [00:16<00:00, 5.94it/s] ==> Finished Epoch 17/50. ++> Evaluate mushroom at epoch 17 ... loss=0.0000 (0.0000): : 100% 5/5 [00:04<00:00, 1.03it/s] ++> Evaluate epoch 17 Finished. ==> Start Training mushroom Epoch 18/50, lr=0.050000 ... loss=1.0503 (1.0854), lr=0.050000: : 18% 18/100 [00:02<00:12, 6.54it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 19% 19/100 [00:03<00:13, 6.22it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 20% 20/100 [00:03<00:11, 6.82it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 21% 21/100 [00:03<00:12, 6.29it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 22% 22/100 [00:03<00:11, 6.76it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 23% 23/100 [00:03<00:12, 6.08it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 24% 24/100 [00:03<00:11, 6.55it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 25% 25/100 [00:04<00:11, 6.26it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 26% 26/100 [00:04<00:10, 6.84it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 27% 27/100 [00:04<00:11, 6.15it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 28% 28/100 [00:04<00:10, 6.61it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 29% 29/100 [00:04<00:11, 6.25it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 30% 30/100 [00:04<00:10, 6.81it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 31% 31/100 [00:05<00:11, 6.11it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 32% 32/100 [00:05<00:10, 6.68it/s]NaN or Inf found in input tensor.
Expected Behavior
Is that a normal situation or might be a problem?
Environment
Ubuntu20.04 Pytorch1.13 Cuda 11.6
All righ after that, i get a sphere Mesh lol
I noticed that using -O2 also leads to nans the the latest commit however in commit b6243518553cf8e10bdbd077f6951f25d1ef9638 the nans do not occur and it works fine. Further, notice that in the latest commit nans tend to reappear after several iterations.
@aradhyamathur Hi, could you give more details on the environment and command lines to reproduce?
I got the same issue. I'm using the latest commit and I got Nan even I tried a lot again and again. I found "backbone grid" makes this error. grid_tcnn doesn't make this issue.
@ashawkey sure, heres the command that leads to nans.
main.py --text a dslr photo of hamburger --workspace burger --iters 15000 -O2 --eval_interval 50
The nans start to appear after 8 epochs. The following are the logs.
[INFO] Cmdline: main.py --text a dslr photo of hamburger --workspace burger --iters 15000 -O2 --eval_interval 50
[INFO] Trainer: df | 2023-05-11_16-15-17 | cuda | fp16 | burger
[INFO] #parameters: 18983
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> [2023-05-11_16-15-18] Start Training burger Epoch 1/150, lr=0.005000 ...
loss=1.0007 (1.0011), lr=0.004924: : 100% 100/100 [00:18<00:00, 5.36it/s]
==> [2023-05-11_16-15-37] Finished Epoch 1/150. CPU=5.6GB, GPU=16.1GB.
==> [2023-05-11_16-15-37] Start Training burger Epoch 2/150, lr=0.004924 ...
loss=1.0002 (1.0006), lr=0.004849: : 100% 100/100 [00:18<00:00, 5.51it/s]
==> [2023-05-11_16-15-55] Finished Epoch 2/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-15-55] Start Training burger Epoch 3/150, lr=0.004849 ...
loss=1.0003 (1.0003), lr=0.004775: : 100% 100/100 [00:18<00:00, 5.50it/s]
==> [2023-05-11_16-16-13] Finished Epoch 3/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-16-13] Start Training burger Epoch 4/150, lr=0.004775 ...
loss=1.0002 (1.0002), lr=0.004702: : 100% 100/100 [00:18<00:00, 5.51it/s]
==> [2023-05-11_16-16-32] Finished Epoch 4/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-16-32] Start Training burger Epoch 5/150, lr=0.004702 ...
loss=1.0000 (1.0002), lr=0.004631: : 100% 100/100 [00:18<00:00, 5.47it/s]
==> [2023-05-11_16-16-50] Finished Epoch 5/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-16-50] Start Training burger Epoch 6/150, lr=0.004631 ...
loss=1.0004 (1.0002), lr=0.004576: : 77% 77/100 [00:14<00:04, 5.49it/s]NaN or Inf found in input tensor.
loss=1.0000 (nan), lr=0.004560: : 100% 100/100 [00:18<00:00, 5.48it/s]
==> [2023-05-11_16-17-08] Finished Epoch 6/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-17-08] Start Training burger Epoch 7/150, lr=0.004560 ...
loss=1.0006 (1.0002), lr=0.004491: : 100% 100/100 [00:18<00:00, 5.47it/s]
==> [2023-05-11_16-17-26] Finished Epoch 7/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-17-26] Start Training burger Epoch 8/150, lr=0.004491 ...
loss=1.0002 (1.0002), lr=0.004460: : 45% 45/100 [00:08<00:10, 5.48it/s]NaN or Inf found in input tensor.
loss=1.0000 (nan), lr=0.004456: : 50% 50/100 [00:09<00:09, 5.49it/s]NaN or Inf found in input tensor.
loss=1.0005 (nan), lr=0.004453: : 55% 55/100 [00:09<00:08, 5.51it/s]NaN or Inf found in input tensor.
loss=1.0000 (nan), lr=0.004449: : 60% 60/100 [00:10<00:07, 5.52it/s]NaN or Inf found in input tensor.
loss=1.0000 (nan), lr=0.004448: : 62% 62/100 [00:11<00:06, 5.53it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.004447: : 63% 63/100 [00:11<00:06, 5.55it/s]NaN or Inf found in input tensor.
loss=1.0002 (nan), lr=0.004445: : 66% 66/100 [00:11<00:06, 5.54it/s]NaN or Inf found in input tensor.
loss=1.0003 (nan), lr=0.004442: : 71% 71/100 [00:12<00:05, 5.52it/s]NaN or Inf found in input tensor.
loss=1.0000 (nan), lr=0.004439: : 76% 76/100 [00:13<00:04, 5.51it/s]NaN or Inf found in input tensor.
The torch version is '2.0.1+cu118' The same prompt works without nans in the commit I have highlighted in previous comments.
Also sometimes even if there are no nans the loss stops to go down and generates blank / green images. While the previous versions used to work with the similar prompt and yield substantial results within 50 epochs lowering lr further prevents nans but slows the entire process down with lower quality results.
-O2
is not well tested, could you try with -O
?
Actually I have been playing around with -02 for the past few months and never encountered this issue especially with the default prompts. I shall also check -0
I have checked -O seems to be working.
However -O2 still gives nans, I think this issue and the issue here are interlinked.
@m0eak can you check this once and see if using the configuration in this commit works. here. This is for -O2, someone also suggested that using cuda 11.7 also prevents the issue of nans.