GPT-SoVITS
GPT-SoVITS copied to clipboard
V4具体怎么在api.py或api_v2.py里使用呢?
感谢大佬开源的好东西【送花花】,请问v4版本具体是怎么在接口里使用呢?
The information commented at the top of api_v2.py is valid. GPT_SoVITS/configs/tts_infer.yaml contains the last configuration used when running the webui inference (from webui.py or from inference_webui_fast.py). So if you last ran webui inference with a v4 model it should be ready to go in tts_infer.yaml.
`
WebAPI文档
python api_v2.py -a 127.0.0.1 -p 9880 -c GPT_SoVITS/configs/tts_infer.yaml
执行参数:
-a - 绑定地址, 默认"127.0.0.1"
-p - 绑定端口, 默认9880
-c - TTS配置文件路径, 默认"GPT_SoVITS/configs/tts_infer.yaml"
调用:
推理
endpoint: /tts GET:
http://127.0.0.1:9880/tts?text=先帝创业未半而中道崩殂,今天下三分,益州疲弊,此诚危急存亡之秋也。&text_lang=zh&ref_audio_path=archive_jingyuan_1.wav&prompt_lang=zh&prompt_text=我是「罗浮」云骑将军景元。不必拘谨,「将军」只是一时的身份,你称呼我景元便可&text_split_method=cut5&batch_size=1&media_type=wav&streaming_mode=true
POST: json { "text": "", # str.(required) text to be synthesized "text_lang: "", # str.(required) language of the text to be synthesized "ref_audio_path": "", # str.(required) reference audio path "aux_ref_audio_paths": [], # list.(optional) auxiliary reference audio paths for multi-speaker tone fusion "prompt_text": "", # str.(optional) prompt text for the reference audio "prompt_lang": "", # str.(required) language of the prompt text for the reference audio "top_k": 5, # int. top k sampling "top_p": 1, # float. top p sampling "temperature": 1, # float. temperature for sampling "text_split_method": "cut0", # str. text split method, see text_segmentation_method.py for details. "batch_size": 1, # int. batch size for inference "batch_threshold": 0.75, # float. threshold for batch splitting. "split_bucket: True, # bool. whether to split the batch into multiple buckets. "speed_factor":1.0, # float. control the speed of the synthesized audio. "streaming_mode": False, # bool. whether to return a streaming response. "seed": -1, # int. random seed for reproducibility. "parallel_infer": True, # bool. whether to use parallel inference. "repetition_penalty": 1.35 # float. repetition penalty for T2S model. "sample_steps": 32, # int. number of sampling steps for VITS model V3. "super_sampling": False, # bool. whether to use super-sampling for audio when using VITS model V3. }
RESP: 成功: 直接返回 wav 音频流, http code 200 失败: 返回包含错误信息的 json, http code 400
命令控制
endpoint: /control
command: "restart": 重新运行 "exit": 结束运行
GET:
http://127.0.0.1:9880/control?command=restart
POST: json { "command": "restart" }
RESP: 无
切换GPT模型
endpoint: /set_gpt_weights
GET:
http://127.0.0.1:9880/set_gpt_weights?weights_path=GPT_SoVITS/pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt
RESP: 成功: 返回"success", http code 200 失败: 返回包含错误信息的 json, http code 400
切换Sovits模型
endpoint: /set_sovits_weights
GET:
http://127.0.0.1:9880/set_sovits_weights?weights_path=GPT_SoVITS/pretrained_models/s2G488k.pth
RESP: 成功: 返回"success", http code 200 失败: 返回包含错误信息的 json, http code 400
`
@dignome 感谢回复。我看到了tts_infer.yaml已经更新了v4模型,直接调用api_v2.py的结果合成出来的声音是很奇怪的,像是模型没匹配好。我的tts_infer.yaml如下: custom: bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base device: cuda is_half: true t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt version: v4 vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v4-pretrained/s2Gv4.pth v1: bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base device: cpu is_half: false t2s_weights_path: GPT_SoVITS/pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt version: v1 vits_weights_path: GPT_SoVITS/pretrained_models/s2G488k.pth v2: bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base device: cpu is_half: false t2s_weights_path: GPT_SoVITS/pretrained_models/gsv-v2final-pretrained/s1bert25hz-5kh-longer-epoch=12-step=369668.ckpt version: v2 vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v2final-pretrained/s2G2333k.pth v3: bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base device: cpu is_half: false t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt version: v3 vits_weights_path: GPT_SoVITS/pretrained_models/s2Gv3.pth v4: bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base device: cpu is_half: false t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt version: v4 vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v4-pretrained/s2Gv4.pth
If there really is a difference you could most likely show this by setting a fixed/static seed value along with matching the other parameters when using both api-v2.py and inference_webui_fast.py. It should produce similar result.
For best speaker reproduction you should finetune a v4 model against a dataset containing at least 10 minutes of audio samples of that speaker using webui.py - then make sure those models are present in your config specified to api-v2.py -c <path/to/your/config.yaml>
感觉奇怪是正常的,因为v4虽然和v3架构相同但是采样率是不一致的,如果你使用和v3相同的api调用参数是必定会出问题的,你可以自行更改相关部分
我目前自己改的处理逻辑如下
--- V3 Mel 函数定义 ---
mel_fn = lambda x: mel_spectrogram_torch( x, **{ "n_fft": 1024, "win_size": 1024, "hop_size": 256, "num_mels": 100, "sampling_rate": 24000, "fmin": 0, "fmax": None, "center": False, }, )
--- 添加 V4 Mel 函数定义 ---
mel_fn_v4 = lambda x: mel_spectrogram_torch( x, **{ "n_fft": 1280, "win_size": 1280, "hop_size": 320, "num_mels": 100, "sampling_rate": 32000, # V4 使用 32kHz Mel "fmin": 0, "fmax": None, "center": False, }, )
elif version in {"v3", "v4"}: # 判断是否为 v3 或 v4
logger.info(f"使用 V3/V4 解码逻辑 (vq_model.decode_encp + CFM/Vocoder)...")
# --- V3/V4 解码逻辑 ---
# 1. 确定目标采样率和 Mel 函数
if model_version == "v4":
tgt_sr = 32000
current_mel_fn = mel_fn_v4
logger.info(f"V4 模型:使用 {tgt_sr}Hz 采样率和 V4 Mel 函数。")
else: # V3
tgt_sr = 24000
current_mel_fn = mel_fn
logger.info(f"V3 模型:使用 {tgt_sr}Hz 采样率和 V3 Mel 函数。")
但是问题在于pyopenjtalk是炸的,最终调用日语还是会报错,很头疼
有比较好的解决方案了吗
好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸
@wangzai23333 @inktree 是的。我自己也尝试去改tts_infer.yaml、api_v2.py里面相关内容,都没有良好的输出,所以想请花儿大佬完善一版api_v2.py 。:)
So is your issue resolved? api_v2.py worked for you?
好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸
GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()
现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。
python api_v2.py 即可 api_v2里面有对接口的详细的描述 注意 使用v4训练的模型 用api.py能跑起来 但是调用的时候会报错 audio, _ = librosa.load(filename, int(hps.data.sampling_rate))
用api_v2跑 改了下GPT_SoVITS/configs/tts_infer.yaml 用自定义的模型就行了
如下图所示 加载成功 看下请求文档 正常发送请求即可 可以使用 apipost apifox postman等接口测试工具测试
So is your issue resolved? api_v2.py worked for you?
not yet
好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸
GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()
现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。
api_v2.py好像改不了sampling_rate,音频出来还是怪得很
好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸
GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower() 现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。
api_v2.py好像改不了sampling_rate,音频出来还是怪得很
我看GPT_SoVITS/TTS_infer_pack/TTS.py的using_vocoder_synthesis函数中已经把v4采样率设为32000了吧,如果version为v4,采样率应该是对的啊?
好像还是修改了tts_infer.yaml,运行api_v2.py会version自动变回v2,生成出来的音频应该是采样率不太对,会炸
GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower() 现在的tts_infer.yaml中version不在根层级,所以get不到,用了默认值v2。最简单的办法是把这里改成v4然后提个bug。
api_v2.py好像改不了sampling_rate,音频出来还是怪得很
我看GPT_SoVITS/TTS_infer_pack/TTS.py的using_vocoder_synthesis函数中已经把v4采样率设为32000了吧,如果version为v4,采样率应该是对的啊?
我看作者说V4采样率是48k啊:
“(4)v4修复了v3非整数倍上采样可能导致的电音问题,原生输出48k音频防闷(而v3原生输出只有24k)。作者认为v4是v3的平替,更多还需测试”