sd-scripts icon indicating copy to clipboard operation
sd-scripts copied to clipboard

sd3 train,loss is nan

Open chongxian opened this issue 1 year ago • 9 comments

image image when I use this command, the loss is nan,how to solve this problem? Thanks for your help the datasets is small,just 290 images, but loss is nan,I try to set the mixed_precision=bf16 and t5xxl_dtype =bf16,but these settings don't work ,the loss is also nan

chongxian avatar Jul 01 '24 08:07 chongxian

t5xxl_dtype=bf16

mliand avatar Jul 01 '24 09:07 mliand

t5xxl_dtype=bf16

I try this setting,but it doesn't work

chongxian avatar Jul 01 '24 09:07 chongxian

Your loss is equal to nan in the initial stage of training. This should be caused by fp16 precision. Set mixed_precision=bf16, and then do not declare t5xxl_dtype.

leonary avatar Jul 01 '24 14:07 leonary

Your loss is equal to nan in the initial stage of training. This should be caused by fp16 precision. Set mixed_precision=bf16, and then do not declare t5xxl_dtype.

image It doesn't work ,the loss is nan

chongxian avatar Jul 02 '24 02:07 chongxian

I solve the problem now,but this problem may be the bug of train code

chongxian avatar Jul 03 '24 08:07 chongxian

Please remove *_sd3_te.npz files in the training directory, when changing the mixed precision or t5xxl_dtype. It recreates cache files.

kohya-ss avatar Jul 04 '24 13:07 kohya-ss

same problem!!! image

image Is there anything wrong in my script?

order-a-lemonade avatar Oct 22 '24 02:10 order-a-lemonade

same problem!!! image

image Is there anything wrong in my script?

try use xformers instead sdpa

bananasss00 avatar Jan 20 '25 22:01 bananasss00

i also got the same issue, is anyone able to solve it

vikas784 avatar Apr 18 '25 10:04 vikas784