Consistently getting CUDNN_STATUS_INTERNAL_ERROR during second_stage training after diff_epoch
I'm currently training the first_stage and second_stage on 2 4090GPUs. Currently the following error is randomly occurring during the second_stage, when epoch > diff_epoch;
Traceback (most recent call last):
File "/home/hounsu/voice/StyleTTS2/train_second_faster.py", line 842, in
Is anyone else suffering this problem? Thanks in advance.
What batch size are you using? I am currently doing first training on two h100 and getting nan in mel loss? have you seen that when you did the first training?
well, 2nd stage does not support distributed training and you seem to be using a custom train_second_faster.py script, so there isn't much we can do unless we see the code as well
What batch size are you using? I am currently doing first training on two h100 and getting nan in mel loss? have you seen that when you did the first training?
Sorry for the late response. I didn't have any nan losses during the first training.
well, 2nd stage does not support distributed training and you seem to be using a custom train_second_faster.py script, so there isn't much we can do unless we see the code as well
Thanks for the comment. As far as I know, It doesn't support DDP but it supports DP. I've been using the DP for second stage training.
actually, second stage training only supports training on a single GPU with the original code - there are repos that have code for DP and DDP, so that's why I asked what's in your custom train_second_faster.py code