GPT-SoVITS icon indicating copy to clipboard operation
GPT-SoVITS copied to clipboard

整合包訓練GPT時出現錯誤:RuntimeError: unmatched '}' in format string

Open win10ogod opened this issue 1 year ago • 9 comments

D:\GPT-SoVITS>runtime\python.exe webui.py
Running on local URL:  http://0.0.0.0:9874
"D:\GPT-SoVITS\runtime\python.exe" GPT_SoVITS/s1_train.py --config_file "TEMP/tmp_s1.yaml"
Seed set to 1234
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
<All keys matched successfully>
ckpt_path: None
[rank: 0] Seed set to 1234
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Traceback (most recent call last):
  File "D:\GPT-SoVITS\GPT_SoVITS\s1_train.py", line 171, in <module>
    main(args)
  File "D:\GPT-SoVITS\GPT_SoVITS\s1_train.py", line 147, in main
    trainer.fit(model, data_module, ckpt_path=ckpt_path)
  File "D:\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "D:\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\trainer\call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "D:\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
  File "D:\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "D:\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 947, in _run
    self.strategy.setup_environment()
  File "D:\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 148, in setup_environment
    self.setup_distributed()
  File "D:\GPT-SoVITS\runtime\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 199, in setup_distributed
    _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
  File "D:\GPT-SoVITS\runtime\lib\site-packages\lightning_fabric\utilities\distributed.py", line 290, in _init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "D:\GPT-SoVITS\runtime\lib\site-packages\torch\distributed\distributed_c10d.py", line 888, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "D:\GPT-SoVITS\runtime\lib\site-packages\torch\distributed\rendezvous.py", line 245, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "D:\GPT-SoVITS\runtime\lib\site-packages\torch\distributed\rendezvous.py", line 176, in _create_c10d_store
    return TCPStore(
RuntimeError: unmatched '}' in format string

win10ogod avatar Jan 18 '24 03:01 win10ogod

我也一样,过了很久说超时,batch_size为3,其他默认,询问gpt,检查了端口并无占用,请问怎么解决

image

文件结构如下 image

temp_s1.yaml文件如下 image

PLL-L avatar Jan 19 '24 07:01 PLL-L

same problem. OS: Win11, CUDA 12.1

ioritree avatar Jan 23 '24 18:01 ioritree

I am facing the same issue as well. 一樣的問題 os : win11 torch 2.1.2+cu118 torchaudio 2.0.1+cu118 torchmetrics 1.3.0.post0 torchvision 0.15.1+cu118

If anyone has insights or solutions, I would greatly appreciate the help. Thank you! 如果有人有見解或解決方案,我將非常感激。 謝謝!

jx06T avatar Jan 25 '24 15:01 jx06T

同樣問題 OS: Win11, CUDA 11.8

以下組合都試過問題還是無法解決 Python 3.9, Python 3.10 PyTorch 2.0.1, PyTorch 2.1.2

light1943 avatar Jan 26 '24 09:01 light1943

我找到暫時的解法了 打開GPT-SoVITS\runtime\lib\site-packages\torch\distributed\rendezvous.py 找到這行start_daemon = rank == 0大約在175行 下方增加一行hostname = "localhost"就可以了

light1943 avatar Jan 30 '24 11:01 light1943

我找到暫時的解法了 打開GPT-SoVITS\runtime\lib\site-packages\torch\distributed\rendezvous.py 找到這行start_daemon = rank == 0大約在175行 下方增加一行hostname = "localhost"就可以了

thanks ,work good.

ioritree avatar Feb 01 '24 05:02 ioritree

非常感謝!! 可以用!

DruidTin avatar Feb 01 '24 07:02 DruidTin

我找到暫時的解法了 打開GPT-SoVITS\runtime\lib\site-packages\torch\distributed\rendezvous.py 找到這行start_daemon = rank == 0大約在175行 下方增加一行hostname = "localhost"就可以了

File "H:\SDAI\GPT-SoVITS\runtime\lib\site-packages\torch\distributed\rendezvous.py", line 177 return TCPStore( hostname, port, world_size, start_daemon, timeout, multi_tenant=True) IndentationError: unexpected indent 已加入,出現另一個報錯

fgod999 avatar Feb 03 '24 15:02 fgod999

是不是沒有縮排?增加的那行開頭要對齊上一行start_daemon = rank == 0 image

light1943 avatar Feb 04 '24 07:02 light1943

(遇到这个问题的大家@win10ogod @light1943 )你们ping 127.0.0.1和ping localhost是同样的结果吗? 看来是地址只能写localhost而不能写127.0.0.1导致的?

RVC-Boss avatar Feb 08 '24 13:02 RVC-Boss

https://github.com/RVC-Boss/GPT-SoVITS/commit/59f35adad85815df27e9c6b33d420f5ebfd8376b 理论上该commit修复了楼主的问题。如果还不行试试楼上添加hostname = "localhost"的方法。

RVC-Boss avatar Feb 08 '24 13:02 RVC-Boss

"C:\GPT-SoVITS-beta\runtime\python.exe" GPT_SoVITS/s1_train.py --config_file "C:\GPT-SoVITS-beta\TEMP/tmp_s1.yaml" Seed set to 1234 Using 16bit Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs <All keys matched successfully> ckpt_path: None [rank: 0] Seed set to 1234 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 Traceback (most recent call last): File "C:\GPT-SoVITS-beta\GPT_SoVITS\s1_train.py", line 170, in main(args) File "C:\GPT-SoVITS-beta\GPT_SoVITS\s1_train.py", line 146, in main trainer.fit(model, data_module, ckpt_path=ckpt_path) File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 544, in fit call._call_and_handle_interrupt( File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\pytorch_lightning\trainer\call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 102, in launch return function(*args, **kwargs) File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 947, in _run self.strategy.setup_environment() File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 148, in setup_environment self.setup_distributed() File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 199, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\lightning_fabric\utilities\distributed.py", line 290, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\torch\distributed\distributed_c10d.py", line 888, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\torch\distributed\rendezvous.py", line 245, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\torch\distributed\rendezvous.py", line 176, in _create_c10d_store return TCPStore( RuntimeError: unmatched '}' in format string

請問這個該怎麼解決,我按照樓主的方式處理過了,但依舊報錯,還請幫幫忙

mohancheng avatar Feb 11 '24 15:02 mohancheng

其實我用這方法依然無解 但是目前最新的版本就修正了 去更新版本吧

fgod999 avatar Feb 12 '24 02:02 fgod999

(遇到这个问题的大家@win10ogod @light1943 )你们ping 127.0.0.1和ping localhost是同样的结果吗? 看来是地址只能写localhost而不能写127.0.0.1导致的?

是的localhost就是127.0.0.1,但在這裡地址只能寫localhost,寫IP 127.0.0.1會出錯。 這是torch的一個奇怪bug,且似乎只在Windows環境下出現。 改成localhost能解是最近在其他地方有人找到的解法。

新的版本59f35ad已經修復這個問題,不需要修改torch的rendezvous.py了。 感謝幫忙!

light1943 avatar Feb 13 '24 12:02 light1943

我一直在更行版本,但这问题依旧存在,依旧跟我上面遇到的报错一模一样,我真的不知道哪里出错

mohancheng avatar Feb 14 '24 16:02 mohancheng

我一直在更行版本,但这问题依旧存在,依旧跟我上面遇到的报错一模一样,我真的不知道哪里出错

如果你照我的方式處理過了,報错應該會變成line 177而不是line 176,可以檢查是不是改错檔案了。 如果改對了,也更新到59f35ad之後的版本還是報錯,那可能就要在自己找解法了。 畢竟torch這個奇怪的問題也沒有人真正的去解析它,只知道在某些環境下hostname不吃ip。

File "C:\GPT-SoVITS-beta\runtime\lib\site-packages\torch\distributed\rendezvous.py", line 176, in _create_c10d_store return TCPStore( RuntimeError: unmatched '}' in format string

light1943 avatar Feb 15 '24 01:02 light1943