trafficstars
在跑訓練GPT時遇到的問題
"C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\python.exe" GPT_SoVITS/s1_train.py --config_file "TEMP/tmp_s1.yaml"
Seed set to 1234
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
<All keys matched successfully>
ckpt_path: None
[rank: 0] Seed set to 1234
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Traceback (most recent call last):
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\GPT_SoVITS\s1_train.py", line 171, in
main(args)
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\GPT_SoVITS\s1_train.py", line 147, in main
trainer.fit(model, data_module, ckpt_path=ckpt_path)
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\trainer\call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 102, in launch
return function(*args, **kwargs)
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 947, in _run
self.strategy.setup_environment()
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 148, in setup_environment
self.setup_distributed()
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 199, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\lib\site-packages\lightning_fabric\utilities\distributed.py", line 290, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\lib\site-packages\torch\distributed\distributed_c10d.py", line 888, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\lib\site-packages\torch\distributed\rendezvous.py", line 245, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "C:\Users\rober\OneDrive\桌面\GPT-SoVITS\runtime\lib\site-packages\torch\distributed\rendezvous.py", line 176, in _create_c10d_store
return TCPStore(
RuntimeError: unmatched '}' in format string
我找到暫時的解法了
打開GPT-SoVITS\runtime\lib\site-packages\torch\distributed\rendezvous.py
找到這行start_daemon = rank == 0大約在175行
下方增加一行hostname = "localhost"就可以了
https://github.com/RVC-Boss/GPT-SoVITS/commit/59f35adad85815df27e9c6b33d420f5ebfd8376b
理论上该commit修复了楼主的问题。如果还不行试试楼上添加hostname = "localhost"的方法。