Getting NaN in Mel Loss during the first few epochs for first training

Open iAdityaVishnu opened this issue 8 months ago • 6 comments

I am noticing NAN in mel losses. The nan is coming in the first few epochs itself. Does anyone know how can this be solved?

This on the master brand code that I am training using two H100.

@yl4579

Apr 17 '25 18:04 iAdityaVishnu

Apr 17 '25 20:04 iAdityaVishnu

You should change the loss values in the config file. https://github.com/Respaired/Tsukasa-Speech/issues/6#issuecomment-2758477322

Apr 18 '25 00:04 kadirnar

If I decrease the learning rate than it would lead to underfitted model

Apr 20 '25 08:04 iAdityaVishnu

Here is tensorboard link

http://151.115.73.7/

Apr 20 '25 09:04 iAdityaVishnu

@yl4579

Apr 20 '25 09:04 iAdityaVishnu

A bit late to the discussion here but StyleTTS2 is currently incompatible with H100 hardware. Also, there are a few issues here mentioning NaN during training. Have a look around.

May 02 '25 17:05 martinambrus