stable-dreamfusion icon indicating copy to clipboard operation
stable-dreamfusion copied to clipboard

I'm using image to 3d functiona and i'm getting NaNs issues.

Open m0eak opened this issue 1 year ago • 11 comments

Description

I am using RTX 3060 with 12GB vRAM, cuda 11.6 pytorch 1.13.1 ubuntu 20.04

Steps to Reproduce

When i input script: main.py -O --image ./data/mushroom_rgba.png --workspace mushroom --save_mesh --iters 5000 After Epoch 17 i will get NaN or Inf found in input tensor everytime. Is that a normal situation or might be some problem? Start Training mushroom Epoch 17/50, lr=0.050000 ... loss=1.0718 (1.0801), lr=0.050000: : 100% 100/100 [00:16<00:00, 5.94it/s] ==> Finished Epoch 17/50. ++> Evaluate mushroom at epoch 17 ... loss=0.0000 (0.0000): : 100% 5/5 [00:04<00:00, 1.03it/s] ++> Evaluate epoch 17 Finished. ==> Start Training mushroom Epoch 18/50, lr=0.050000 ... loss=1.0503 (1.0854), lr=0.050000: : 18% 18/100 [00:02<00:12, 6.54it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 19% 19/100 [00:03<00:13, 6.22it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 20% 20/100 [00:03<00:11, 6.82it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 21% 21/100 [00:03<00:12, 6.29it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 22% 22/100 [00:03<00:11, 6.76it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 23% 23/100 [00:03<00:12, 6.08it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 24% 24/100 [00:03<00:11, 6.55it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 25% 25/100 [00:04<00:11, 6.26it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 26% 26/100 [00:04<00:10, 6.84it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 27% 27/100 [00:04<00:11, 6.15it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 28% 28/100 [00:04<00:10, 6.61it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 29% 29/100 [00:04<00:11, 6.25it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 30% 30/100 [00:04<00:10, 6.81it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 31% 31/100 [00:05<00:11, 6.11it/s]NaN or Inf found in input tensor. loss=nan (nan), lr=0.050000: : 32% 32/100 [00:05<00:10, 6.68it/s]NaN or Inf found in input tensor.

Expected Behavior

Is that a normal situation or might be a problem?

Environment

Ubuntu20.04 Pytorch1.13 Cuda 11.6

m0eak avatar Apr 25 '23 09:04 m0eak

All righ after that, i get a sphere Mesh lol

m0eak avatar Apr 25 '23 09:04 m0eak

I noticed that using -O2 also leads to nans the the latest commit however in commit b6243518553cf8e10bdbd077f6951f25d1ef9638 the nans do not occur and it works fine. Further, notice that in the latest commit nans tend to reappear after several iterations.

aradhyamathur avatar May 08 '23 17:05 aradhyamathur

@aradhyamathur Hi, could you give more details on the environment and command lines to reproduce?

ashawkey avatar May 09 '23 02:05 ashawkey

I got the same issue. I'm using the latest commit and I got Nan even I tried a lot again and again. I found "backbone grid" makes this error. grid_tcnn doesn't make this issue.

dedoogong avatar May 10 '23 14:05 dedoogong

@ashawkey sure, heres the command that leads to nans. main.py --text a dslr photo of hamburger --workspace burger --iters 15000 -O2 --eval_interval 50 The nans start to appear after 8 epochs. The following are the logs.

[INFO] Cmdline: main.py --text a dslr photo of hamburger --workspace burger --iters 15000 -O2 --eval_interval 50
[INFO] Trainer: df | 2023-05-11_16-15-17 | cuda | fp16 | burger
[INFO] #parameters: 18983
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> [2023-05-11_16-15-18] Start Training burger Epoch 1/150, lr=0.005000 ...
loss=1.0007 (1.0011), lr=0.004924: : 100% 100/100 [00:18<00:00,  5.36it/s]
==> [2023-05-11_16-15-37] Finished Epoch 1/150. CPU=5.6GB, GPU=16.1GB.
==> [2023-05-11_16-15-37] Start Training burger Epoch 2/150, lr=0.004924 ...
loss=1.0002 (1.0006), lr=0.004849: : 100% 100/100 [00:18<00:00,  5.51it/s]
==> [2023-05-11_16-15-55] Finished Epoch 2/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-15-55] Start Training burger Epoch 3/150, lr=0.004849 ...
loss=1.0003 (1.0003), lr=0.004775: : 100% 100/100 [00:18<00:00,  5.50it/s]
==> [2023-05-11_16-16-13] Finished Epoch 3/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-16-13] Start Training burger Epoch 4/150, lr=0.004775 ...
loss=1.0002 (1.0002), lr=0.004702: : 100% 100/100 [00:18<00:00,  5.51it/s]
==> [2023-05-11_16-16-32] Finished Epoch 4/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-16-32] Start Training burger Epoch 5/150, lr=0.004702 ...
loss=1.0000 (1.0002), lr=0.004631: : 100% 100/100 [00:18<00:00,  5.47it/s]
==> [2023-05-11_16-16-50] Finished Epoch 5/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-16-50] Start Training burger Epoch 6/150, lr=0.004631 ...
loss=1.0004 (1.0002), lr=0.004576: :  77% 77/100 [00:14<00:04,  5.49it/s]NaN or Inf found in input tensor.
loss=1.0000 (nan), lr=0.004560: : 100% 100/100 [00:18<00:00,  5.48it/s]
==> [2023-05-11_16-17-08] Finished Epoch 6/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-17-08] Start Training burger Epoch 7/150, lr=0.004560 ...
loss=1.0006 (1.0002), lr=0.004491: : 100% 100/100 [00:18<00:00,  5.47it/s]
==> [2023-05-11_16-17-26] Finished Epoch 7/150. CPU=5.8GB, GPU=16.1GB.
==> [2023-05-11_16-17-26] Start Training burger Epoch 8/150, lr=0.004491 ...
loss=1.0002 (1.0002), lr=0.004460: :  45% 45/100 [00:08<00:10,  5.48it/s]NaN or Inf found in input tensor.
loss=1.0000 (nan), lr=0.004456: :  50% 50/100 [00:09<00:09,  5.49it/s]NaN or Inf found in input tensor.
loss=1.0005 (nan), lr=0.004453: :  55% 55/100 [00:09<00:08,  5.51it/s]NaN or Inf found in input tensor.
loss=1.0000 (nan), lr=0.004449: :  60% 60/100 [00:10<00:07,  5.52it/s]NaN or Inf found in input tensor.
loss=1.0000 (nan), lr=0.004448: :  62% 62/100 [00:11<00:06,  5.53it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.004447: :  63% 63/100 [00:11<00:06,  5.55it/s]NaN or Inf found in input tensor.
loss=1.0002 (nan), lr=0.004445: :  66% 66/100 [00:11<00:06,  5.54it/s]NaN or Inf found in input tensor.
loss=1.0003 (nan), lr=0.004442: :  71% 71/100 [00:12<00:05,  5.52it/s]NaN or Inf found in input tensor.
loss=1.0000 (nan), lr=0.004439: :  76% 76/100 [00:13<00:04,  5.51it/s]NaN or Inf found in input tensor.

The torch version is '2.0.1+cu118' The same prompt works without nans in the commit I have highlighted in previous comments.

aradhyamathur avatar May 11 '23 10:05 aradhyamathur

Also sometimes even if there are no nans the loss stops to go down and generates blank / green images. While the previous versions used to work with the similar prompt and yield substantial results within 50 epochs lowering lr further prevents nans but slows the entire process down with lower quality results.

aradhyamathur avatar May 11 '23 10:05 aradhyamathur

-O2 is not well tested, could you try with -O?

ashawkey avatar May 11 '23 14:05 ashawkey

Actually I have been playing around with -02 for the past few months and never encountered this issue especially with the default prompts. I shall also check -0

aradhyamathur avatar May 11 '23 16:05 aradhyamathur

I have checked -O seems to be working.

aradhyamathur avatar May 11 '23 17:05 aradhyamathur

However -O2 still gives nans, I think this issue and the issue here are interlinked.

aradhyamathur avatar May 11 '23 18:05 aradhyamathur

@m0eak can you check this once and see if using the configuration in this commit works. here. This is for -O2, someone also suggested that using cuda 11.7 also prevents the issue of nans.

aradhyamathur avatar May 12 '23 20:05 aradhyamathur