Trainer icon indicating copy to clipboard operation
Trainer copied to clipboard

[Bug] distrbute --use_ddp=true timeout with error 1/4 clients joined.

Open devops724 opened this issue 9 months ago • 0 comments

Describe the bug

python -m TTS.bin.train_tts --config_path finetune_config.json --restore_path /home/user/.local/share/tts/tts_models--fa--custom--glow-tts/model_file.pth --use_ddp=true --gpus="0,1,2,3" Found 24005 files in /home/user/workspace/dataset/train-tts3/dataset Using model: glow_tts Setting up Audio Processor... | sample_rate: 22050 | resample: False | num_mels: 80 | log_func: np.log10 | min_level_db: -100 | frame_shift_ms: None | frame_length_ms: None | ref_level_db: 20 | fft_size: 1024 | power: 1.5 | preemphasis: 0.0 | griffin_lim_iters: 60 | signal_norm: True | symmetric_norm: True | mel_fmin: 0 | mel_fmax: None | pitch_fmin: 1.0 | pitch_fmax: 640.0 | spec_gain: 20.0 | stft_pad_mode: reflect | max_norm: 4.0 | clip_norm: True | do_trim_silence: True | trim_db: 45 | do_sound_norm: False | do_amp_to_db_linear: True | do_amp_to_db_mel: True | do_rms_norm: False | db_level: None | stats_path: None | base: 10 | hop_length: 256 | win_length: 1024 fatal: not a git repository (or any parent up to mount point /) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). fatal: not a git repository (or any parent up to mount point /)

Training Environment: | > Backend: Torch | > Mixed precision: True | > Precision: fp16 | > Current device: 0 | > Num. of GPUs: 4 | > Num. of CPUs: 48 | > Num. of Torch Threads: 24 | > Torch seed: 54321 | > Torch CUDNN: True | > Torch CUDNN deterministic: False | > Torch CUDNN benchmark: False | > Torch TF32 MatMul: False Start Tensorboard: tensorboard --logdir=glowtts_persian_finetune-March-07-2025_01+37AM-0000000 Using PyTorch DDP Traceback (most recent call last): File "", line 198, in runmodule_as_main File "", line 88, in runcode File "/home/user/workspace/dataset/coqui-ai-TTS/TTS/bin/train_tts.py", line 77, in main() File "/home/user/workspace/dataset/coqui-ai-TTS/TTS/bin/train_tts.py", line 63, in main trainer = Trainer( ^^^^^^^^ File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/trainer/trainer.py", line 310, in init init_distributed( File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/trainer/utils/distributed.py", line 65, in init_distributed dist.init_process_group( File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper func_return = func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1714, in init_process_group store, rank, world_size = next(rendezvous_iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 226, in tcprendezvous_handler store = createc10d_store( ^^^^^^^^^^^^^^^^^^^ File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 194, in createc10d_store return TCPStore( ^^^^^^^^^ torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/4 clients joined. cat finetune_config.json {
"run_name": "glowtts_persian_finetune", "model": "glow_tts", "batch_size": 8, "eval_batch_size": 4, "num_loader_workers": 4, "num_eval_loader_workers": 4, "run_eval": true, "test_delay_epochs": 5, "epochs": 1000, "text_cleaner": "phoneme_cleaners", "use_phonemes": true, "phoneme_language": "fa", "phoneme_cache_path": "ph_cache", "enable_eos_bos_chars": false, "precompute_num_workers": 4, "print_step": 10, "print_eval": true, "mixed_precision": true, "output_path": "./", "lr": 0.0001, "characters": { "characters_class": "TTS.tts.utils.text.characters.IPAPhonemes", "vocabdict": null, "pad": "<PAD>", "eos": "<EOS>", "bos": "<BOS>", "blank": "<BLNK>", "characters": "\u02c8\u02cc\u02d0\u02d1pbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029faegiouwy\u026a\u028a\u0329\u00e6\u0251\u0254\u0259\u025a\u025b\u025d\u0268\u0303\u0289\u028c\u028d0123456789"#$%+/=ABCDEFGHIJKLMNOPRSTUVWXYZ[]^{}", "punctuations": "!(),-.:;? \u0320\u060c\u061b\u061f\u200c<>", "phonemes": "\u02c8\u02cc\u02d0\u02d1pbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029faegiouwy\u026a\u028a\u0329\u00e6\u0251\u0254\u0259\u025a\u025b\u025d\u0268\u0303\u0289\u028c\u028d0123456789"#$%+/=ABCDEFGHIJKLMNOPRSTUVWXYZ[]^_{}", "is_unique": true, "is_sorted": true }, "datasets": [ { "formatter": "ljspeech", "path": "./dataset/", "meta_file_train": "tts_dataset.csv", "ignored_speakers": [] } ], "test_sentences": [ "\u0633\u0644\u0637\u0627\u0646 \u0645\u062d\u0645\u0648\u062f \u062f\u0631 \u0632\u0645\u0633\u062a\u0627\u0646\u06cc \u0633\u062e\u062a \u0628\u0647 \u0637\u0644\u062e\u06a9 \u06af\u0641\u062a \u06a9\u0647: \u0628\u0627 \u0627\u06cc\u0646 \u062c\u0627\u0645\u0647 \u06cc \u06cc\u06a9 \u0644\u0627 \u062f\u0631 \u0627\u06cc\u0646 \u0633\u0631\u0645\u0627 \u0686\u0647 \u0645\u06cc \u06a9\u0646\u06cc ", "\u0645\u0631\u062f\u06cc \u0646\u0632\u062f \u0628\u0642\u0627\u0644\u06cc \u0622\u0645\u062f \u0648 \u06af\u0641\u062a \u067e\u06cc\u0627\u0632 \u0647\u0645 \u062f\u0647 \u062a\u0627 \u062f\u0647\u0627\u0646 \u0628\u062f\u0627\u0646 \u062e\u0648 \u0634\u0628\u0648\u06cc \u0633\u0627\u0632\u0645.", "\u0627\u0632 \u0645\u0627\u0644 \u062e\u0648\u062f \u067e\u0627\u0631\u0647 \u0627\u06cc \u06af\u0648\u0634\u062a \u0628\u0633\u062a\u0627\u0646 \u0648 \u0632\u06cc\u0631\u0647 \u0628\u0627\u06cc\u06cc \u0645\u0639\u0637\u0651\u0631 \u0628\u0633\u0627\u0632", "\u06cc\u06a9 \u0628\u0627\u0631 \u0647\u0645 \u0627\u0632 \u062c\u0647\u0646\u0645 \u0628\u06af\u0648\u06cc\u06cc\u062f.", "\u06cc\u06a9\u06cc \u0627\u0633\u0628\u06cc \u0628\u0647 \u0639\u0627\u0631\u06cc\u062a \u062e\u0648\u0627\u0633\u062a" ] }

To Reproduce

python -m TTS.bin.train_tts --config_path finetune_config.json --restore_path /home/user/.local/share/tts/tts_models--fa--custom--glow-tts/model_file.pth --use_ddp=true --gpus="0,1,2,3"

Expected behavior

No response

Logs


Environment

pip freeze | grep TTS
-e git+https://github.com/idiap/coqui-ai-TTS.git@4c593c620854d9cd2e177382abf48082f7c9f2ae#egg=coqui_tts
pip freeze | grep torch
torch==2.6.0
torchaudio==2.6.0

Additional context

No response

devops724 avatar Mar 07 '25 07:03 devops724