First step training very slow and high GPU memory
Thank you for your great works. I am using 3x40GB A100 GPUs to train the first step. I set max_len=300 frames. These GPUs just enough for batch_size=5 and training speed is very slow for both mix_precision=fp16 and fp32. Is this normal with styletts2?
Another problem is that loss become NaN after more than 10k steps
INFO:2025-03-05 18:39:06,022: Epoch [1/10], Step [13340/394954], Mel Loss: 0.73000, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000 INFO:2025-03-05 18:39:18,793: Epoch [1/10], Step [13350/394954], Mel Loss: 0.71747, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000 INFO:2025-03-05 18:39:30,779: Epoch [1/10], Step [13360/394954], Mel Loss: 0.72086, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000 INFO:2025-03-05 18:39:43,441: Epoch [1/10], Step [13370/394954], Mel Loss: nan, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000 INFO:2025-03-05 18:39:55,656: Epoch [1/10], Step [13380/394954], Mel Loss: nan, Gen Loss: 0.00000, Disc Loss: 0.00000, Mono Loss: 0.00000, S2S Loss: 0.00000, SLM Loss: 0.00000
Did you do this?
https://github.com/yl4579/StyleTTS2/pull/253
Yesterday I worked it with 8xA100 GPUs using batch-size=16 and got a memory error. The max_len was high though. Still, I think something's wrong.
@kadirnar No, I didn't load a pretrained model. You mean your total batch_size=16 (for 8 gpus) or each gpu has batch_size=16 (total batch size = 16*8)?
@kadirnar No, I didn't load a pretrained model. You mean your total batch_size=16 (for 8 gpus) or each gpu has batch_size=16 (total batch size = 16*8)?
I set the batch size here to 16 and each GPU uses 70GB. It gives a memory error at epoch 50. Now I've set the batch size to 8. It uses 50GB of memory. However, having a batch size of 8 for 8xH100 is really poor.
https://github.com/yl4579/StyleTTS2/blob/main/Configs/config.yml#L8 Dataset: https://huggingface.co/datasets/shb777/gemini-flash-2.0-speech
@zaidato I got the same error as you when I set the batch-size to 8 😆
You got an error at epoch 50. I think it's because you set TMA_epoch: 50 # TMA starting epoch (1st stage). You need to decrease batch size to fix that In my case, A100x40GB I set batch_size=3 and max_len = 300. Very small batch size for A100 GPU
You got an error at epoch 50. I think it's because you set TMA_epoch: 50 # TMA starting epoch (1st stage). You need to decrease batch size to fix that In my case, A100x40GB I set batch_size=3 and max_len = 300. Very small batch size for A100 GPU
I added FP16 support to the StyleTTS2 model and made a few adjustments. I trained it with a batch size of 32. Currently at epoch 86, the loss value started returning nan. Do you have any knowledge about this issue? How many GPUs are you training on?
I faced the same problem when using FP16
@zaidato I managed to train using this repo. There was only a bug with context_length. I fixed that by updating the mel_dataset. If there are successful results after training, I will create a new repo and explain it in detail. I just need to experiment with epoch values.
https://github.com/Respaired/Tsukasa-Speech
@kadirnar What does context length mean? In your repo, you set batch_size: 64 and max_len: 560. How can you increase these values without getting out of memory?
@kadirnar What does context length mean? In your repo, you set batch_size: 64 and max_len: 560. How can you increase these values without getting out of memory?
The transcript file length should not exceed 512. I shared the experiments I conducted here.
https://github.com/Respaired/Tsukasa-Speech/issues/6