GPT-SoVITS icon indicating copy to clipboard operation
GPT-SoVITS copied to clipboard

能否調用cpu訓練

Open zackzheng1121 opened this issue 1 year ago • 43 comments

image 搞了半天,我打標都打好了,結果來那麼掃興的通知 能用cpu訓練嗎?

zackzheng1121 avatar Jan 29 '24 13:01 zackzheng1121

mark,顺便问下,有没有改好cpu得大佬,说下训练推理速度如何?

zhuangzhuangliu2345 avatar Jan 30 '24 03:01 zhuangzhuangliu2345

我用4070都要等待,cpu就算是能用,估计也是超级漫长的等待,建议还是更新一下硬件

angenet avatar Jan 30 '24 03:01 angenet

不建议使用cpu,12400+32gb内存 双进程 batchsize为20(把内存吃满)时40s/it

ISDHN avatar Jan 30 '24 05:01 ISDHN

不建议使用cpu,12400+32gb内存 双进程 batchsize为20(把内存吃满)时40s/it

如果用来做推理呢,应该不慢吧,大佬试过了没有

zhuangzhuangliu2345 avatar Jan 30 '24 06:01 zhuangzhuangliu2345

6s的音频合成了25s

ISDHN avatar Jan 30 '24 06:01 ISDHN

我好奇怎麼用cpu訓練

zackzheng1121 avatar Jan 30 '24 07:01 zackzheng1121

1.把 GPT-SoVITS\GPT_SoVITS\prepare_datasets 下三个文件里的 os.environ["CUDA_VISIBLE_DEVICES"] = os.environ.get("_CUDA_VISIBLE_DEVICES") 这一句注释掉
2. 把"GPT-SoVITS\GPT_SoVITS\s2_train.py" 里的"""Assume Single Node Multi GPUs Training Only"""下面一行注释掉
3. 还是上面那个文件,把所有to("mps")改成to("cpu")

ISDHN avatar Jan 30 '24 08:01 ISDHN

抱歉,上一条有缺漏。
s2_train里的os.environ["CUDA_VISIBLE_DEVICES"] = hps.train.gpu_numbers.replace("-", ",") 这一句也要注释掉
在s2_train的main里要手动设置n_gpu以指定开几个进程训练。 s1_train的main里trainer的初始化把accelerator改成cpu,把devices改成1,如果运行gpt训练时出现类型不匹配的问题再把precision改成32

ISDHN avatar Jan 30 '24 10:01 ISDHN

CPU训练理论上是可行的,主要就是像 @ISDHN 说的把代码的相关部分更改成CPU。训练没有测试过,但是推理似乎是比GPU要慢许多

Lion-Wu avatar Jan 30 '24 13:01 Lion-Wu

抱歉,上一条有缺漏。 s2_train里的os.environ["CUDA_VISIBLE_DEVICES"] = os.environ.get("_CUDA_VISIBLE_DEVICES") 这一句也要注释掉 在s2_train的main里要手动设置n_gpu以指定开几个进程训练。 s1_train的main里trainer的初始化把accelerator改成cpu,把devices改成1

你好,使用[预打包文件]修改了上述代码,在1B-微调训练 没有训练出来的模型文件

erhuzi001 avatar Jan 31 '24 03:01 erhuzi001

1.把 GPT-SoVITS\GPT_SoVITS\prepare_datasets 下三个文件里的 os.environ["CUDA_VISIBLE_DEVICES"] = os.environ.get("_CUDA_VISIBLE_DEVICES") 这一句注释掉 2. 把"GPT-SoVITS\GPT_SoVITS\s2_train.py" 里的"""Assume Single Node Multi GPUs Training Only"""下面一行注释掉 3. 还是上面那个文件,把所有to("mps")改成to("cpu")

image

zackzheng1121 avatar Jan 31 '24 03:01 zackzheng1121

1.把 GPT-SoVITS\GPT_SoVITS\prepare_datasets 下三个文件里的 os.environ["CUDA_VISIBLE_DEVICES"] = os.environ.get("_CUDA_VISIBLE_DEVICES") 这一句注释掉 2. 把"GPT-SoVITS\GPT_SoVITS\s2_train.py" 里的"""Assume Single Node Multi GPUs Training Only"""下面一行注释掉 3. 还是上面那个文件,把所有to("mps")改成to("cpu")

image

你好像找错文件夹了,我说的GPT-SoVITS\GPT_SoVITS\prepare_datasets中第一个GPT-SoVITS是有webui.py的那个文件夹

ISDHN avatar Jan 31 '24 03:01 ISDHN

1.把 GPT-SoVITS\GPT_SoVITS\prepare_datasets 下三个文件里的 os.environ["CUDA_VISIBLE_DEVICES"] = os.environ.get("_CUDA_VISIBLE_DEVICES") 这一句注释掉 2. 把"GPT-SoVITS\GPT_SoVITS\s2_train.py" 里的"""Assume Single Node Multi GPUs Training Only"""下面一行注释掉 3. 还是上面那个文件,把所有to("mps")改成to("cpu")

image

你好像找错文件夹了,我说的GPT-SoVITS\GPT_SoVITS\prepare_datasets中第一个GPT-SoVITS是有webui.py的那个文件夹 啊? 找到了

zackzheng1121 avatar Jan 31 '24 03:01 zackzheng1121

image 還是一樣

zackzheng1121 avatar Jan 31 '24 03:01 zackzheng1121

image 還是一樣

对的,还是这样显示,但是不用管,继续进行后续步骤

ISDHN avatar Jan 31 '24 03:01 ISDHN

image 還是一樣

对的,还是这样显示,但是不用管,继续进行后续步骤

謝謝,已經開始在跑了 image 然後又出錯了 image

後台 image

zackzheng1121 avatar Jan 31 '24 03:01 zackzheng1121

webui截个图

ISDHN avatar Jan 31 '24 03:01 ISDHN

webui截圖

image

zackzheng1121 avatar Jan 31 '24 03:01 zackzheng1121

大佬, 按上面的步骤 1B-微调训练没有训练出来的模型文件 咋搞T-T

erhuzi001 avatar Jan 31 '24 03:01 erhuzi001

大佬, 按上面的步骤 没有训练出来的模型文件 咋搞T-T

我不是大佬,你問錯人了QAQ

zackzheng1121 avatar Jan 31 '24 03:01 zackzheng1121

大佬, 按上面的步骤 1B-微调训练没有训练出来的模型文件 咋搞T-T

看看后台命令行

ISDHN avatar Jan 31 '24 03:01 ISDHN

大佬, 按上面的步骤 1B-微调训练没有训练出来的模型文件 咋搞T-T

看看后台命令行

那我要幹啥

zackzheng1121 avatar Jan 31 '24 03:01 zackzheng1121

你填的list文件路径里好像有奇怪的字符(在D:\前面

ISDHN avatar Jan 31 '24 03:01 ISDHN

image

zackzheng1121 avatar Jan 31 '24 03:01 zackzheng1121

image

图上我看不出来,但是后台消息里显示有多一个字符

ISDHN avatar Jan 31 '24 03:01 ISDHN

大佬, 按上面的步骤 1B-微调训练没有训练出来的模型文件 咋搞T-T

看看后台命令行

SoVITS训练结束后台只有

"D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\python.exe" GPT_SoVITS/s2_train.py --config "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\TEMP/tmp_s2.json"

GPT训练结束后

"D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\python.exe" GPT_SoVITS/s1_train.py --config_file "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\TEMP/tmp_s1.yaml"
Seed set to 1234
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
<All keys matched successfully>
ckpt_path: None
[rank: 0] Seed set to 1234
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [17729382180.china.huawei.com]:59168 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [17729382180.china.huawei.com]:59168 (system error: 10049 - 在其上下文中,该请求的地址无效。).
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

semantic_data_len: 0
phoneme_data_len: 3
Empty DataFrame
Columns: [item_name, semantic_audio]
Index: []
Traceback (most recent call last):
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\GPT_SoVITS\s1_train.py", line 170, in <module>
    main(args)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\GPT_SoVITS\s1_train.py", line 146, in main
    trainer.fit(model, data_module, ckpt_path=ckpt_path)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 950, in _run
    call._call_setup_hook(self)  # allow user to setup lightning_module in accelerator environment
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\call.py", line 92, in _call_setup_hook
    _call_lightning_datamodule_hook(trainer, "setup", stage=fn)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\call.py", line 179, in _call_lightning_datamodule_hook
    return fn(*args, **kwargs)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\GPT_SoVITS\AR\data\data_module.py", line 29, in setup
    self._train_dataset = Text2SemanticDataset(
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\GPT_SoVITS\AR\data\dataset.py", line 107, in __init__
    self.init_batch()
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\GPT_SoVITS\AR\data\dataset.py", line 187, in init_batch
    for _ in range(max(2, int(min_num / leng))):
ZeroDivisionError: division by zero

erhuzi001 avatar Jan 31 '24 03:01 erhuzi001

image

图上我看不出来,但是后台消息里显示有多一个字符

我看看

zackzheng1121 avatar Jan 31 '24 03:01 zackzheng1121

image

图上我看不出来,但是后台消息里显示有多一个字符

建议自行搜索\u202a,这个不是本代码库的问题或cpu训练的问题

ISDHN avatar Jan 31 '24 03:01 ISDHN

大佬, 按上面的步骤 1B-微调训练没有训练出来的模型文件 咋搞T-T

看看后台命令行

SoVITS训练结束后台只有 "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\python.exe" GPT_SoVITS/s2_train.py --config "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\TEMP/tmp_s2.json" GPT训练结束后

"D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\python.exe" GPT_SoVITS/s1_train.py --config_file "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\TEMP/tmp_s1.yaml"
Seed set to 1234
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
<All keys matched successfully>
ckpt_path: None
[rank: 0] Seed set to 1234
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [17729382180.china.huawei.com]:59168 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [17729382180.china.huawei.com]:59168 (system error: 10049 - 在其上下文中,该请求的地址无效。).
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

semantic_data_len: 0
phoneme_data_len: 3
Empty DataFrame
Columns: [item_name, semantic_audio]
Index: []
Traceback (most recent call last):
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\GPT_SoVITS\s1_train.py", line 170, in <module>
    main(args)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\GPT_SoVITS\s1_train.py", line 146, in main
    trainer.fit(model, data_module, ckpt_path=ckpt_path)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 950, in _run
    call._call_setup_hook(self)  # allow user to setup lightning_module in accelerator environment
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\call.py", line 92, in _call_setup_hook
    _call_lightning_datamodule_hook(trainer, "setup", stage=fn)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\runtime\lib\site-packages\pytorch_lightning\trainer\call.py", line 179, in _call_lightning_datamodule_hook
    return fn(*args, **kwargs)
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\GPT_SoVITS\AR\data\data_module.py", line 29, in setup
    self._train_dataset = Text2SemanticDataset(
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\GPT_SoVITS\AR\data\dataset.py", line 107, in __init__
    self.init_batch()
  File "D:\users\xxxx\Downloads\GPT-SoVITS-beta\GPT-SoVITS-beta0128\GPT_SoVITS\AR\data\dataset.py", line 187, in init_batch
    for _ in range(max(2, int(min_num / leng))):
ZeroDivisionError: division by zero

你s2_train.py怎么改的

  1. 注释了这一行
 """Assume Single Node Multi GPUs Training Only"""
    # assert torch.cuda.is_available() or torch.backends.mps.is_available(), "Only GPU training is allowed."
  1. to("mps")改成to("cpu")

  2. 这句话无完全匹配 os.environ["CUDA_VISIBLE_DEVICES"] = os.environ.get("_CUDA_VISIBLE_DEVICES") 把这句话注释了 #os.environ["CUDA_VISIBLE_DEVICES"] = hps.train.gpu_numbers.replace("-", ",")

  3. 在s2_train的main里要手动设置n_gpu以指定开几个进程训练---不知道咋改,原本就是n_gpus = 1

erhuzi001 avatar Jan 31 '24 03:01 erhuzi001

你s2_train.py怎么改的

  1. 注释了这一行
 """Assume Single Node Multi GPUs Training Only"""
    # assert torch.cuda.is_available() or torch.backends.mps.is_available(), "Only GPU training is allowed."
  1. to("mps")改成to("cpu")
  2. 这句话无完全匹配 os.environ["CUDA_VISIBLE_DEVICES"] = os.environ.get("_CUDA_VISIBLE_DEVICES") 把这句话注释了 #os.environ["CUDA_VISIBLE_DEVICES"] = hps.train.gpu_numbers.replace("-", ",")
  3. 在s2_train的main里要手动设置n_gpu以指定开几个进程训练---不知道咋改,原本就是n_gpus = 1
def main():
    """Assume Single Node Multi GPUs Training Only"""
    # assert torch.cuda.is_available() or torch.backends.mps.is_available(), "Only GPU training is allowed."
    # if torch.backends.mps.is_available():
    #     n_gpus = 1
    # else:
    #     n_gpus = torch.cuda.device_count()
    n_gpus = 1
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = str(randint(20000, 55555))
    mp.spawn(
        run,
        nprocs=n_gpus,
        args=(
            n_gpus,
            hps,
        ),
    )

谢谢提醒,我上面写错了💦💦💦

ISDHN avatar Jan 31 '24 03:01 ISDHN