GPT-SoVITS icon indicating copy to clipboard operation
GPT-SoVITS copied to clipboard

GPT training训练结束后未保存模型

Open kli017 opened this issue 1 year ago • 7 comments

RT,训练过程未显示报错,全部epoch跑完后没有GPT_weights下没有文件保存

kli017 avatar Jan 22 '24 07:01 kli017

+1

dario-github avatar Jan 24 '24 08:01 dario-github

+1

liuzl avatar Jan 25 '24 10:01 liuzl

+1

xidong9995 avatar Jan 26 '24 07:01 xidong9995

是不是batch size太大了,你batch size设置得多少?

RVC-Boss avatar Jan 26 '24 12:01 RVC-Boss

好像不是batch size的问题,设1都不行

dsdf783 avatar Jan 27 '24 12:01 dsdf783

我有个类似的报错: Traceback (most recent call last): File "Q:\GPT-SoVITS-beta\GPT-SoVITS\runtime\lib\site-packages\torch\serialization.py", line 441, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) File "Q:\GPT-SoVITS-beta\GPT-SoVITS\runtime\lib\site-packages\torch\serialization.py", line 668, in _save zip_file.write_record(name, storage.data_ptr(), num_bytes) RuntimeError: [enforce fail at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\caffe2\serialize\inline_container.cc:476] . PytorchStreamWriter failed writing file data/2085: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "Q:\GPT-SoVITS-beta\GPT-SoVITS\runtime\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap fn(i, *args) File "Q:\GPT-SoVITS-beta\GPT-SoVITS\GPT_SoVITS\s2_train.py", line 236, in run train_and_evaluate( File "Q:\GPT-SoVITS-beta\GPT-SoVITS\GPT_SoVITS\s2_train.py", line 444, in train_and_evaluate utils.save_checkpoint( File "Q:\GPT-SoVITS-beta\GPT-SoVITS\GPT_SoVITS\utils.py", line 78, in save_checkpoint torch.save( File "Q:\GPT-SoVITS-beta\GPT-SoVITS\runtime\lib\site-packages\torch\serialization.py", line 442, in save return File "Q:\GPT-SoVITS-beta\GPT-SoVITS\runtime\lib\site-packages\torch\serialization.py", line 291, in exit self.file_like.write_end_of_file() RuntimeError: [enforce fail at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\caffe2\serialize\inline_container.cc:337] . unexpected pos 203241280 vs 203241172

无法保存1Ba-SoVITS训练的模型

tuhang avatar Jan 27 '24 14:01 tuhang

这个问题是因为 s1_train.py 里函数on_train_epoch_end(self, trainer, pl_module)中的 if 语句未执行,故checkpoint无法保存,直接更改为 if Ture:可以解决。但这么解决似乎不太合理。 这里的 self._should_save_on_train_epoch_end(trainer) 和 self._should_skip_saving_checkpoint(trainer) 应该如何读入呢?

dsdf783 avatar Jan 27 '24 15:01 dsdf783

我也有这个问题,将s1_train.py中66行改为 if True:即可解决,但不知道是什么原因导致代码没有保存gpt模型

HuaQitian519 avatar Jan 28 '24 02:01 HuaQitian519

Fixed now.

RVC-Boss avatar Jan 28 '24 11:01 RVC-Boss

这个页面 : https://boke.shjiang.com/index.php/archives/276/ 有这么一段 “训练完GPT_weights和SoVITS_weights文件夹没有模型文件

首先查看报错,可能是文件太小不符合每张显卡的batch_size导致无法保存。先尝试把该参数改成1 再次训练,查看那俩个文件夹是否生成出了模型文件”

我也遇到这个问题。不过还没有尝试。

w0z1y avatar Apr 21 '24 14:04 w0z1y