TTS icon indicating copy to clipboard operation
TTS copied to clipboard

[Bug] Training XTTSv2 with DDP leads to weird training lags

Open NikitaKononov opened this issue 7 months ago β€’ 1 comments

Describe the bug

Hello, training XTTSv2 leads to weird training lags with using DDP - training gets stuck with no errors x6 RTX a6000 and 512GB RAM

Here is monitoring GPU load graph. Purple - gpu0, green - gpu1 (all the rest GPUs behave like gpu1)

image

With 2 or 4 GPU situation remains the same

I think there's some kind of error in Trainer or in xtts scripts maybe my dataset is kinda large, 2000hrs of 1 language

To Reproduce

python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1,2,3,4,5

Expected behavior

training must not get stuck

Logs

No response

Environment

tts version: latest

Additional context

No response

NikitaKononov avatar Jul 01 '24 19:07 NikitaKononov