[Bug] distrbute --use_ddp=true timeout with error 1/4 clients joined.

Open devops724 opened this issue 9 months ago • 0 comments

Describe the bug

Training Environment: | > Backend: Torch | > Mixed precision: True | > Precision: fp16 | > Current device: 0 | > Num. of GPUs: 4 | > Num. of CPUs: 48 | > Num. of Torch Threads: 24 | > Torch seed: 54321 | > Torch CUDNN: True | > Torch CUDNN deterministic: False | > Torch CUDNN benchmark: False | > Torch TF32 MatMul: False Start Tensorboard: tensorboard --logdir=glowtts_persian_finetune-March-07-2025_01+37AM-0000000 Using PyTorch DDP Traceback (most recent call last): File "", line 198, in runmodule_as_main File "", line 88, in runcode File "/home/user/workspace/dataset/coqui-ai-TTS/TTS/bin/train_tts.py", line 77, in main() File "/home/user/workspace/dataset/coqui-ai-TTS/TTS/bin/train_tts.py", line 63, in main trainer = Trainer( ^^^^^^^^ File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/trainer/trainer.py", line 310, in init init_distributed( File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/trainer/utils/distributed.py", line 65, in init_distributed dist.init_process_group( File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper func_return = func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1714, in init_process_group store, rank, world_size = next(rendezvous_iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 226, in tcprendezvous_handler store = createc10d_store( ^^^^^^^^^^^^^^^^^^^ File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 194, in createc10d_store return TCPStore( ^^^^^^^^^ torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/4 clients joined. cat finetune_config.json {
"run_name": "glowtts_persian_finetune", "model": "glow_tts", "batch_size": 8, "eval_batch_size": 4, "num_loader_workers": 4, "num_eval_loader_workers": 4, "run_eval": true, "test_delay_epochs": 5, "epochs": 1000, "text_cleaner": "phoneme_cleaners", "use_phonemes": true, "phoneme_language": "fa", "phoneme_cache_path": "ph_cache", "enable_eos_bos_chars": false, "precompute_num_workers": 4, "print_step": 10, "print_eval": true, "mixed_precision": true, "output_path": "./", "lr": 0.0001, "characters": { "characters_class": "TTS.tts.utils.text.characters.IPAPhonemes", "vocabdict": null, "pad": "<PAD>", "eos": "<EOS>", "bos": "<BOS>", "blank": "<BLNK>", "characters": "\u02c8\u02cc\u02d0\u02d1pbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029faegiouwy\u026a\u028a\u0329\u00e6\u0251\u0254\u0259\u025a\u025b\u025d\u0268\u0303\u0289\u028c\u028d0123456789"#$%+/=ABCDEFGHIJKLMNOPRSTUVWXYZ[]^{}", "punctuations": "!(),-.:;? \u0320\u060c\u061b\u061f\u200c<>", "phonemes": "\u02c8\u02cc\u02d0\u02d1pbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029faegiouwy\u026a\u028a\u0329\u00e6\u0251\u0254\u0259\u025a\u025b\u025d\u0268\u0303\u0289\u028c\u028d0123456789"#$%+/=ABCDEFGHIJKLMNOPRSTUVWXYZ[]^_{}", "is_unique": true, "is_sorted": true }, "datasets": [ { "formatter": "ljspeech", "path": "./dataset/", "meta_file_train": "tts_dataset.csv", "ignored_speakers": [] } ], "test_sentences": [ "\u0633\u0644\u0637\u0627\u0646 \u0645\u062d\u0645\u0648\u062f \u062f\u0631 \u0632\u0645\u0633\u062a\u0627\u0646\u06cc \u0633\u062e\u062a \u0628\u0647 \u0637\u0644\u062e\u06a9 \u06af\u0641\u062a \u06a9\u0647: \u0628\u0627 \u0627\u06cc\u0646 \u062c\u0627\u0645\u0647 \u06cc \u06cc\u06a9 \u0644\u0627 \u062f\u0631 \u0627\u06cc\u0646 \u0633\u0631\u0645\u0627 \u0686\u0647 \u0645\u06cc \u06a9\u0646\u06cc ", "\u0645\u0631\u062f\u06cc \u0646\u0632\u062f \u0628\u0642\u0627\u0644\u06cc \u0622\u0645\u062f \u0648 \u06af\u0641\u062a \u067e\u06cc\u0627\u0632 \u0647\u0645 \u062f\u0647 \u062a\u0627 \u062f\u0647\u0627\u0646 \u0628\u062f\u0627\u0646 \u062e\u0648 \u0634\u0628\u0648\u06cc \u0633\u0627\u0632\u0645.", "\u0627\u0632 \u0645\u0627\u0644 \u062e\u0648\u062f \u067e\u0627\u0631\u0647 \u0627\u06cc \u06af\u0648\u0634\u062a \u0628\u0633\u062a\u0627\u0646 \u0648 \u0632\u06cc\u0631\u0647 \u0628\u0627\u06cc\u06cc \u0645\u0639\u0637\u0651\u0631 \u0628\u0633\u0627\u0632", "\u06cc\u06a9 \u0628\u0627\u0631 \u0647\u0645 \u0627\u0632 \u062c\u0647\u0646\u0645 \u0628\u06af\u0648\u06cc\u06cc\u062f.", "\u06cc\u06a9\u06cc \u0627\u0633\u0628\u06cc \u0628\u0647 \u0639\u0627\u0631\u06cc\u062a \u062e\u0648\u0627\u0633\u062a" ] }

To Reproduce

python -m TTS.bin.train_tts --config_path finetune_config.json --restore_path /home/user/.local/share/tts/tts_models--fa--custom--glow-tts/model_file.pth --use_ddp=true --gpus="0,1,2,3"

Expected behavior

No response

Logs

Environment

pip freeze | grep TTS
-e git+https://github.com/idiap/coqui-ai-TTS.git@4c593c620854d9cd2e177382abf48082f7c9f2ae#egg=coqui_tts
pip freeze | grep torch
torch==2.6.0
torchaudio==2.6.0

Additional context

No response

Mar 07 '25 07:03 devops724