vits2_pytorch icon indicating copy to clipboard operation
vits2_pytorch copied to clipboard

Training stuck

Open Madhavan0123 opened this issue 2 years ago • 18 comments

Hello ,

Thanks for all the effort to create this repo. When I launch training it runs for a few steps and then I see no progress at all. Its just stuck without any progress for a long time. It still hasnt progressed. INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/G_0.pth INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/D_0.pth INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/DUR_0.pth Loading train data: 4%|████████████▍

have you encountered this before ? Any help would be extremely helpful

Madhavan0123 avatar Nov 14 '23 23:11 Madhavan0123

Hi, temporarily turn off duration discriminator and tell me if it works.

p0p4k avatar Nov 15 '23 00:11 p0p4k

Yes it seems to be working for now. Any reason with the duration discriminator is causing the issue ?

Madhavan0123 avatar Nov 15 '23 04:11 Madhavan0123

I feel my implementation was too naive. Might need to correct it with some testing. Busy on other models now, will do it when I have some time on me. Let me know about the audio quality after you train. Thanks.

p0p4k avatar Nov 15 '23 04:11 p0p4k

Hello,thank you for your great effort ! I meet the same problem and I want to know will you correct it recently or still busy on other models?

CreepJoye avatar Nov 27 '23 08:11 CreepJoye

I have moved to improving pflowtts.

p0p4k avatar Dec 12 '23 07:12 p0p4k

I have moved to improving pflowtts.

Hi, p0p4k, How is the pflowtts growing now? is it a better choice then vits2? can it support both normal tts and zero-shot tts?

JohnHerry avatar Jan 19 '24 01:01 JohnHerry

I think better than vits/vits2. Only downside it not being e2e.

p0p4k avatar Jan 19 '24 08:01 p0p4k

I think better than vits/vits2. Only downside it not being e2e.

ok, thank you.

JohnHerry avatar Jan 19 '24 08:01 JohnHerry

@p0p4k do you know the bug here?

codeghees avatar Mar 14 '24 01:03 codeghees

@codeghees which part ? training stuck part?

p0p4k avatar Mar 14 '24 01:03 p0p4k

yep

codeghees avatar Mar 14 '24 01:03 codeghees

In the same boat.

codeghees avatar Mar 14 '24 01:03 codeghees

@codeghees did not look into this personally cause of no gpu yet. Maybe you can try to debug and send a PR. I can assist you. Thanks a lot!

p0p4k avatar Mar 14 '24 05:03 p0p4k

Yep, will do! Trying to debug this.

codeghees avatar Mar 14 '24 15:03 codeghees

@p0p4k bug is on line scaler.scale(loss_gen_all).backward()

Seems like GradScalar has issues with multi-gpu. I removed it and replaced it with standard backprop. The issue persists. Looks like a multi GPU issue.

codeghees avatar Mar 21 '24 01:03 codeghees

Works on single gpu?

p0p4k avatar Mar 21 '24 01:03 p0p4k

Haven't tested yet. Trying a run with fp16 enabled.

codeghees avatar Mar 21 '24 01:03 codeghees

@p0p4k I have no issues for single GPU training. But it will stuck if I do multiple GPU training. Any success for resolving the issue?

farzanehnakhaee70 avatar May 01 '24 11:05 farzanehnakhaee70