StyleTTS2 icon indicating copy to clipboard operation
StyleTTS2 copied to clipboard

Consistently getting CUDNN_STATUS_INTERNAL_ERROR during second_stage training after diff_epoch

Open hanshounsu opened this issue 9 months ago • 5 comments

I'm currently training the first_stage and second_stage on 2 4090GPUs. Currently the following error is randomly occurring during the second_stage, when epoch > diff_epoch;

Traceback (most recent call last): File "/home/hounsu/voice/StyleTTS2/train_second_faster.py", line 842, in main() File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 1161, in call return self.main(*args, **kwargs) File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 1082, in main rv = self.invoke(ctx) File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/home/hounsu/voice/StyleTTS2/train_second_faster.py", line 488, in main g_loss.backward() File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/torch/_tensor.py", line 581, in backward torch.autograd.backward( File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/torch/autograd/init.py", line 347, in backward _engine_run_backward( File "/home/hounsu/anaconda3/envs/styletts2/lib/python3.9/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Is anyone else suffering this problem? Thanks in advance.

hanshounsu avatar Mar 08 '25 08:03 hanshounsu

What batch size are you using? I am currently doing first training on two h100 and getting nan in mel loss? have you seen that when you did the first training?

iAdityaVishnu avatar Apr 20 '25 10:04 iAdityaVishnu

well, 2nd stage does not support distributed training and you seem to be using a custom train_second_faster.py script, so there isn't much we can do unless we see the code as well

martinambrus avatar May 28 '25 07:05 martinambrus

What batch size are you using? I am currently doing first training on two h100 and getting nan in mel loss? have you seen that when you did the first training?

Sorry for the late response. I didn't have any nan losses during the first training.

hanshounsu avatar May 28 '25 07:05 hanshounsu

well, 2nd stage does not support distributed training and you seem to be using a custom train_second_faster.py script, so there isn't much we can do unless we see the code as well

Thanks for the comment. As far as I know, It doesn't support DDP but it supports DP. I've been using the DP for second stage training.

hanshounsu avatar May 28 '25 07:05 hanshounsu

actually, second stage training only supports training on a single GPU with the original code - there are repos that have code for DP and DDP, so that's why I asked what's in your custom train_second_faster.py code

martinambrus avatar May 28 '25 13:05 martinambrus