GPU memory error occurs at epoch 50 during phase first stage model training.
I'm training a model with 8xH100. However, I'm getting a GPU memory error at epoch 50. How can I fix this? @yl4579
@kadirnar did you find a solution for this?
@kadirnar did you find a solution for this?
I don't remember. You can check this for my latest attempts:
https://github.com/Respaired/Tsukasa-Speech/issues/6
1st stage epoch 50 is where TMA code kicks in, so you'll need to lower your batch size considerably here and continue with that batch until the end of stage 1 training. Same thing happens with epoch 20 in 2nd stage training - you'll need to about halve the batch size there and continue from last checkpoint when you get OOM.
As a side note, StyleTTS2 does not currently work on H100 hardware.