StyleTTS2 First step training very slow and high GPU memory

Thank you for your great works. I am using 3x40GB A100 GPUs to train the first step. I set max_len=300 frames. These GPUs just enough for batch_size=5 and training speed is very slow for both mix_precision=fp16 and fp32. Is this normal with styletts2?

Another problem is that loss become NaN after more than 10k steps

INFO:2025-03-05 18:39:06,022: Epoch [1/10], Step [13340/394954], Mel Loss: 0.73000, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000 INFO:2025-03-05 18:39:18,793: Epoch [1/10], Step [13350/394954], Mel Loss: 0.71747, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000 INFO:2025-03-05 18:39:30,779: Epoch [1/10], Step [13360/394954], Mel Loss: 0.72086, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000 INFO:2025-03-05 18:39:43,441: Epoch [1/10], Step [13370/394954], Mel Loss: nan, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000 INFO:2025-03-05 18:39:55,656: Epoch [1/10], Step [13380/394954], Mel Loss: nan, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000

Mar 07 '25 07:03 zaidato

Did you do this?

https://github.com/yl4579/StyleTTS2/pull/253

Yesterday I worked it with 8xA100 GPUs using batch-size=16 and got a memory error. The max_len was high though. Still, I think something's wrong.

Mar 07 '25 12:03 kadirnar

@kadirnar No, I didn't load a pretrained model. You mean your total batch_size=16 (for 8 gpus) or each gpu has batch_size=16 (total batch size = 16*8)?

Mar 09 '25 11:03 zaidato

@kadirnar No, I didn't load a pretrained model. You mean your total batch_size=16 (for 8 gpus) or each gpu has batch_size=16 (total batch size = 16*8)?

I set the batch size here to 16 and each GPU uses 70GB. It gives a memory error at epoch 50. Now I've set the batch size to 8. It uses 50GB of memory. However, having a batch size of 8 for 8xH100 is really poor.

https://github.com/yl4579/StyleTTS2/blob/main/Configs/config.yml#L8 Dataset: https://huggingface.co/datasets/shb777/gemini-flash-2.0-speech

Mar 09 '25 11:03 kadirnar

@zaidato I got the same error as you when I set the batch-size to 8 😆

Mar 09 '25 21:03 kadirnar

You got an error at epoch 50. I think it's because you set TMA_epoch: 50 # TMA starting epoch (1st stage). You need to decrease batch size to fix that In my case, A100x40GB I set batch_size=3 and max_len = 300. Very small batch size for A100 GPU

Mar 11 '25 02:03 zaidato

You got an error at epoch 50. I think it's because you set TMA_epoch: 50 # TMA starting epoch (1st stage). You need to decrease batch size to fix that In my case, A100x40GB I set batch_size=3 and max_len = 300. Very small batch size for A100 GPU

I added FP16 support to the StyleTTS2 model and made a few adjustments. I trained it with a batch size of 32. Currently at epoch 86, the loss value started returning nan. Do you have any knowledge about this issue? How many GPUs are you training on?

Mar 11 '25 10:03 kadirnar

I faced the same problem when using FP16

Mar 11 '25 15:03 zaidato

@zaidato I managed to train using this repo. There was only a bug with context_length. I fixed that by updating the mel_dataset. If there are successful results after training, I will create a new repo and explain it in detail. I just need to experiment with epoch values.

https://github.com/Respaired/Tsukasa-Speech

Mar 12 '25 21:03 kadirnar

@kadirnar What does context length mean? In your repo, you set batch_size: 64 and max_len: 560. How can you increase these values without getting out of memory?

Mar 24 '25 04:03 zaidato

@kadirnar What does context length mean? In your repo, you set batch_size: 64 and max_len: 560. How can you increase these values without getting out of memory?

The transcript file length should not exceed 512. I shared the experiments I conducted here.

https://github.com/Respaired/Tsukasa-Speech/issues/6

Mar 24 '25 11:03 kadirnar