感谢大佬开源的好东西【送花花】，请问v4版本具体是怎么在接口里使用呢？

Apr 21 '25 13:04 yangyuke001

The information commented at the top of api_v2.py is valid. GPT_SoVITS/configs/tts_infer.yaml contains the last configuration used when running the webui inference (from webui.py or from inference_webui_fast.py). So if you last ran webui inference with a v4 model it should be ready to go in tts_infer.yaml.

`

WebAPI文档

python api_v2.py -a 127.0.0.1 -p 9880 -c GPT_SoVITS/configs/tts_infer.yaml

执行参数:

 -a  -  绑定地址, 默认"127.0.0.1" 
 -p  -  绑定端口, 默认9880 
 -c  -  TTS配置文件路径, 默认"GPT_SoVITS/configs/tts_infer.yaml"

调用:

推理

endpoint: /tts GET:

http://127.0.0.1:9880/tts?text=先帝创业未半而中道崩殂，今天下三分，益州疲弊，此诚危急存亡之秋也。&text_lang=zh&ref_audio_path=archive_jingyuan_1.wav&prompt_lang=zh&prompt_text=我是「罗浮」云骑将军景元。不必拘谨，「将军」只是一时的身份，你称呼我景元便可&text_split_method=cut5&batch_size=1&media_type=wav&streaming_mode=true

POST: json { "text": "", # str.(required) text to be synthesized "text_lang: "", # str.(required) language of the text to be synthesized "ref_audio_path": "", # str.(required) reference audio path "aux_ref_audio_paths": [], # list.(optional) auxiliary reference audio paths for multi-speaker tone fusion "prompt_text": "", # str.(optional) prompt text for the reference audio "prompt_lang": "", # str.(required) language of the prompt text for the reference audio "top_k": 5, # int. top k sampling "top_p": 1, # float. top p sampling "temperature": 1, # float. temperature for sampling "text_split_method": "cut0", # str. text split method, see text_segmentation_method.py for details. "batch_size": 1, # int. batch size for inference "batch_threshold": 0.75, # float. threshold for batch splitting. "split_bucket: True, # bool. whether to split the batch into multiple buckets. "speed_factor":1.0, # float. control the speed of the synthesized audio. "streaming_mode": False, # bool. whether to return a streaming response. "seed": -1, # int. random seed for reproducibility. "parallel_infer": True, # bool. whether to use parallel inference. "repetition_penalty": 1.35 # float. repetition penalty for T2S model. "sample_steps": 32, # int. number of sampling steps for VITS model V3. "super_sampling": False, # bool. whether to use super-sampling for audio when using VITS model V3. }

RESP: 成功: 直接返回 wav 音频流， http code 200 失败: 返回包含错误信息的 json, http code 400

命令控制

endpoint: /control

command: "restart": 重新运行 "exit": 结束运行

GET:

http://127.0.0.1:9880/control?command=restart

POST: json { "command": "restart" }

RESP: 无

切换GPT模型

endpoint: /set_gpt_weights

GET:

http://127.0.0.1:9880/set_gpt_weights?weights_path=GPT_SoVITS/pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt

RESP: 成功: 返回"success", http code 200 失败: 返回包含错误信息的 json, http code 400

切换Sovits模型

endpoint: /set_sovits_weights

GET:

http://127.0.0.1:9880/set_sovits_weights?weights_path=GPT_SoVITS/pretrained_models/s2G488k.pth

RESP: 成功: 返回"success", http code 200 失败: 返回包含错误信息的 json, http code 400

`

Apr 22 '25 22:04 dignome

@dignome 感谢回复。我看到了tts_infer.yaml已经更新了v4模型，直接调用api_v2.py的结果合成出来的声音是很奇怪的，像是模型没匹配好。我的tts_infer.yaml如下： custom: bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base device: cuda is_half: true t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt version: v4 vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v4-pretrained/s2Gv4.pth v1: bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base device: cpu is_half: false t2s_weights_path: GPT_SoVITS/pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt version: v1 vits_weights_path: GPT_SoVITS/pretrained_models/s2G488k.pth v2: bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base device: cpu is_half: false t2s_weights_path: GPT_SoVITS/pretrained_models/gsv-v2final-pretrained/s1bert25hz-5kh-longer-epoch=12-step=369668.ckpt version: v2 vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v2final-pretrained/s2G2333k.pth v3: bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base device: cpu is_half: false t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt version: v3 vits_weights_path: GPT_SoVITS/pretrained_models/s2Gv3.pth v4: bert_base_path: GPT_SoVITS/pretrained_models/chinese-roberta-wwm-ext-large cnhuhbert_base_path: GPT_SoVITS/pretrained_models/chinese-hubert-base device: cpu is_half: false t2s_weights_path: GPT_SoVITS/pretrained_models/s1v3.ckpt version: v4 vits_weights_path: GPT_SoVITS/pretrained_models/gsv-v4-pretrained/s2Gv4.pth

Apr 23 '25 02:04 yangyuke001

If there really is a difference you could most likely show this by setting a fixed/static seed value along with matching the other parameters when using both api-v2.py and inference_webui_fast.py. It should produce similar result.

For best speaker reproduction you should finetune a v4 model against a dataset containing at least 10 minutes of audio samples of that speaker using webui.py - then make sure those models are present in your config specified to api-v2.py -c <path/to/your/config.yaml>

Apr 23 '25 02:04 dignome

感觉奇怪是正常的，因为v4虽然和v3架构相同但是采样率是不一致的，如果你使用和v3相同的api调用参数是必定会出问题的，你可以自行更改相关部分

我目前自己改的处理逻辑如下

--- V3 Mel 函数定义 ---

mel_fn = lambda x: mel_spectrogram_torch( x, **{ "n_fft": 1024, "win_size": 1024, "hop_size": 256, "num_mels": 100, "sampling_rate": 24000, "fmin": 0, "fmax": None, "center": False, }, )

--- 添加 V4 Mel 函数定义 ---

mel_fn_v4 = lambda x: mel_spectrogram_torch( x, **{ "n_fft": 1280, "win_size": 1280, "hop_size": 320, "num_mels": 100, "sampling_rate": 32000, # V4 使用 32kHz Mel "fmin": 0, "fmax": None, "center": False, }, )

    elif version in {"v3", "v4"}: # 判断是否为 v3 或 v4
        logger.info(f"使用 V3/V4 解码逻辑 (vq_model.decode_encp + CFM/Vocoder)...")
        # --- V3/V4 解码逻辑 ---
        # 1. 确定目标采样率和 Mel 函数
        if model_version == "v4":
            tgt_sr = 32000
            current_mel_fn = mel_fn_v4
            logger.info(f"V4 模型：使用 {tgt_sr}Hz 采样率和 V4 Mel 函数。")
        else: # V3
            tgt_sr = 24000
            current_mel_fn = mel_fn
            logger.info(f"V3 模型：使用 {tgt_sr}Hz 采样率和 V3 Mel 函数。")

但是问题在于pyopenjtalk是炸的，最终调用日语还是会报错，很头疼

Apr 23 '25 03:04 inktree

有比较好的解决方案了吗

Apr 23 '25 06:04 xy3xy3

好像还是修改了tts_infer.yaml，运行api_v2.py会version自动变回v2，生成出来的音频应该是采样率不太对，会炸

Apr 23 '25 06:04 wangzai23333

@wangzai23333 @inktree 是的。我自己也尝试去改tts_infer.yaml、api_v2.py里面相关内容，都没有良好的输出，所以想请花儿大佬完善一版api_v2.py 。：）

Apr 23 '25 08:04 yangyuke001

So is your issue resolved? api_v2.py worked for you?

Apr 24 '25 13:04 dignome

好像还是修改了tts_infer.yaml，运行api_v2.py会version自动变回v2，生成出来的音频应该是采样率不太对，会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()

现在的tts_infer.yaml中version不在根层级，所以get不到，用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

Apr 24 '25 23:04 YunZLu

python api_v2.py 即可 api_v2里面有对接口的详细的描述注意使用v4训练的模型用api.py能跑起来但是调用的时候会报错 audio, _ = librosa.load(filename, int(hps.data.sampling_rate))

用api_v2跑改了下GPT_SoVITS/configs/tts_infer.yaml 用自定义的模型就行了

如下图所示加载成功看下请求文档正常发送请求即可可以使用 apipost apifox postman等接口测试工具测试

Apr 25 '25 02:04 lucasmen9527

So is your issue resolved? api_v2.py worked for you?

not yet

Apr 25 '25 03:04 yangyuke001

好像还是修改了tts_infer.yaml，运行api_v2.py会version自动变回v2，生成出来的音频应该是采样率不太对，会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower()

现在的tts_infer.yaml中version不在根层级，所以get不到，用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate，音频出来还是怪得很

Apr 25 '25 03:04 yangyuke001

好像还是修改了tts_infer.yaml，运行api_v2.py会version自动变回v2，生成出来的音频应该是采样率不太对，会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower() 现在的tts_infer.yaml中version不在根层级，所以get不到，用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate，音频出来还是怪得很

我看GPT_SoVITS/TTS_infer_pack/TTS.py的using_vocoder_synthesis函数中已经把v4采样率设为32000了吧，如果version为v4，采样率应该是对的啊?

Apr 25 '25 05:04 YunZLu

好像还是修改了tts_infer.yaml，运行api_v2.py会version自动变回v2，生成出来的音频应该是采样率不太对，会炸

GPT_SoVITS/TTS_infer_pack/TTS.py 第290行的问题: version = configs.get("version", "v2").lower() 现在的tts_infer.yaml中version不在根层级，所以get不到，用了默认值v2。最简单的办法是把这里改成v4然后提个bug。

api_v2.py好像改不了sampling_rate，音频出来还是怪得很

我看GPT_SoVITS/TTS_infer_pack/TTS.py的using_vocoder_synthesis函数中已经把v4采样率设为32000了吧，如果version为v4，采样率应该是对的啊?

我看作者说V4采样率是48k啊：

“（4）v4修复了v3非整数倍上采样可能导致的电音问题，原生输出48k音频防闷（而v3原生输出只有24k）。作者认为v4是v3的平替，更多还需测试”

Apr 27 '25 10:04 yangyuke001

GPT-SoVITS
GPT-SoVITS copied to clipboard

V4具体怎么在api.py或api_v2.py里使用呢？

WebAPI文档

执行参数:

调用:

推理

命令控制

切换GPT模型

切换Sovits模型

--- V3 Mel 函数定义 ---

--- 添加 V4 Mel 函数定义 ---

GPT-SoVITS GPT-SoVITS copied to clipboard

V4具体怎么在api.py或api_v2.py里使用呢？

WebAPI文档

执行参数:

调用:

推理

命令控制

切换GPT模型

切换Sovits模型

--- V3 Mel 函数定义 ---

--- 添加 V4 Mel 函数定义 ---

GPT-SoVITS
GPT-SoVITS copied to clipboard