denoising-diffusion-pytorch icon indicating copy to clipboard operation
denoising-diffusion-pytorch copied to clipboard

Under any parameter, loss is always reduced to nan during training

Open 177488ZL opened this issue 2 years ago • 7 comments

Training under our dataset, img_ size=256 batch_ size=4 or img_ size=128 batch_ Size=16, the final result of all training is that loss is reduced to nan

177488ZL avatar Feb 28 '23 07:02 177488ZL

same problem after 50000 training step

Harry-Stephanie avatar Mar 01 '23 02:03 Harry-Stephanie

Any idea how to debug this issue?

snippler avatar Mar 07 '23 09:03 snippler

Training under our dataset, img_ size=256 batch_ size=4 or img_ size=128 batch_ Size=16, the final result of all training is that loss is reduced to nan

Hi, have you found the solution?

Echo-jyt avatar Mar 15 '23 12:03 Echo-jyt

Training under our dataset, img_ size=256 batch_ size=4 or img_ size=128 batch_ Size=16, the final result of all training is that loss is reduced to nan

Hi, have you found the solution?

no.....lol

177488ZL avatar Mar 23 '23 09:03 177488ZL

no,I tried to train with various parameters for two weeks, and in the end, the loss was always null. My graphics card is 3080ti

At 2023-03-15 20:01:39, "Jin Yuntao" @.***> wrote:

Training under our dataset, img_ size=256 batch_ size=4 or img_ size=128 batch_ Size=16, the final result of all training is that loss is reduced to nan

Hi, have you found the solution?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

177488ZL avatar Mar 24 '23 05:03 177488ZL

Did you double check your data ?(Make sure that all your data does not contain nan values)

Adrian744 avatar Jun 03 '23 19:06 Adrian744

I had the same issue. If you are training with amp = True, be sure to run the script with accelerate launch script.py. That fixed my problem.

perkeje avatar May 23 '24 12:05 perkeje