Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

train from scratch then loss became nan

Open leonardodora opened this issue 10 months ago • 6 comments

hi, when I train the t2v model from scratch, the loss became nan. I know it is important to have pretrained model like pixart. But it is hard to explained why the loss became nan if training the model from scratch. image

leonardodora avatar Apr 22 '24 02:04 leonardodora

We do not encounter this problem. One potential possibility is the half-precision training. You should use bf16 instead of fp16.

zhengzangw avatar Apr 25 '24 11:04 zhengzangw

We do not encounter this problem. One potential possibility is the half-precision training. You should use bf16 instead of fp16.

Thanks, but bf16 is set in config file. Did you train without pixart weights and the loss is not abnormal?

leonardodora avatar Apr 26 '24 02:04 leonardodora

Our computing resource is limited and does not try training from scratch for a long time.

zhengzangw avatar Apr 26 '24 02:04 zhengzangw

Our computing resource is limited and does not try training from scratch for a long time.

Thanks for your reply! btw, the newest update is awsome!

leonardodora avatar Apr 26 '24 02:04 leonardodora

@leonardodora I have the same problem. Have you solved it

kawayi12318 avatar Apr 28 '24 09:04 kawayi12318

@leonardodora I have the same problem. Have you solved it

Not yet. Maybe a pixart need be retrained

leonardodora avatar Apr 29 '24 08:04 leonardodora