stable-dreamfusion icon indicating copy to clipboard operation
stable-dreamfusion copied to clipboard

Only black images being generated || loss=nan after epoch 10.

Open nazarPuriy opened this issue 2 years ago • 5 comments

Description

I used the model the previous week. Some generations worked, others not, generating just noise. I tried it today but all the time it ends up generating black images and with loss=nan. I think nothing has changed with the files. I tried cloning the repository again but it doesn't work. Here is how the training starts:

image

And then after epoch 10 it gets this: image

The validation folder looks like that: image

python main.py --text "a hamburger" --workspace trial -O --albedo Is also not helping

The thing that I don't understand is why it suddenly stopped working. Any idea?

Sometimes the loss is nan little bit after epoch ten. But always after epoch ten the images generated are black. Also, is it normal that the loss is so low?

Steps to Reproduce

Prompts given in the README

Expected Behavior

No loss=nan

Environment

Ubuntu 22.04 torch.version --> Version: '1.13.1+cu117'

nazarPuriy avatar Feb 21 '23 20:02 nazarPuriy

@nazarPuriy Hi, this is strange, maybe you could try using full precision mode? (commenting the opt.fp16 = True line in main.py)

ashawkey avatar Feb 23 '23 06:02 ashawkey

I think the problem is the nvidia drivers. I switched from 525 to 470 and it stopped working. Now I am using 470 and it works again. Any idea why this is happening?

nazarPuriy avatar Feb 23 '23 09:02 nazarPuriy

@nazarPuriy Hi, this is strange, maybe you could try using full precision mode? (commenting the opt.fp16 = True line in main.py)

I meet the same problem. Maybe It is a bug need to fix, Do you have any idea of why it happens? Thanks!

StellarCheng avatar Mar 06 '23 09:03 StellarCheng

Reverting back to an older commit seems to be working for now. I am facing this issue with torch '2.0.1+cu118', I tried training with a lower lr which seems to prevent the nans however the results are not good.

aradhyamathur avatar May 11 '23 18:05 aradhyamathur

I checked that in my case I face this issue particularly with -O2 and not -O.

aradhyamathur avatar May 11 '23 18:05 aradhyamathur