GPT-SoVITS icon indicating copy to clipboard operation
GPT-SoVITS copied to clipboard

About api_v2

Open phoenixdna opened this issue 4 months ago • 3 comments

When I tried succesfully deployed api_v2, I found when I recieved the sound stream from the server, I got frequently blank wav files..I don't know what's wrong and does any of you encouter the same?

my code is like following:

` def generate_tts_post(text, text_lang, ref_audio_path, prompt_text, prompt_lang, output_path="output/out.wav", server_url="http://127.0.0.1:9880/tts"): payload = { "text": text, "text_lang": text_lang, "ref_audio_path": str(ref_audio_path), "prompt_text": prompt_text, "prompt_lang": prompt_lang, "text_split_method": "cut0", "batch_size": 1, "media_type": "wav", "fragment_interval": 0, "streaming_mode": False, }

try:
    print(f"[→] 发送 POST 请求到 {server_url}")
    response = requests.post(server_url, json=payload, stream=False)

    if response.status_code != 200:
        print(f"[✗] 请求失败,状态码: {response.status_code}")
        print(response.text)
        return None

    content_type = response.headers.get("Content-Type", "")
    if not content_type.startswith("audio"):
        print(f"[✗] 返回的不是音频,而是: {content_type}")
        print(response.content.decode("utf-8", errors="ignore"))
        return None

    os.makedirs(Path(output_path).parent, exist_ok=True)
    with open(output_path, "wb") as f:
        f.write(response.content)

    if is_audio_effective(output_path):
        print(f"[✓] 音频保存成功:{output_path}")
        return output_path
    else:
        print(f"[✗] 音频无效,将删除:{output_path}")
        os.remove(output_path)
        return None

except Exception as e:
    print(f"[✗] 请求异常:{e}")
    return None

`

from the tts_infer.yaml, I have changed all the combinations of the models, but still the same result with api_v2. it will have 30-50% oppotunity to generate the blank wav stream , and it sometimes give me the wrong combination of diffierent voice pieces(10%). (However the inference_cli will be much better , but tooooo slow indeed)

Really don't have idea why this happen ? Is it because the configuration/parameters wrong? Please somebody help

phoenixdna avatar Aug 08 '25 04:08 phoenixdna

Can you give me a detailed log? Or give a sample that can be reproduced. In addition, text_split_method should be set to other values, such as "cut5"

ChasonJiang avatar Aug 08 '25 08:08 ChasonJiang

please see the following log, after this I got a wav file less than 50kb

Set seed to 769865282 并行推理模式已开启 当开启并行推理模式时,SoVits V3/4模型不支持分桶处理,已自动关闭分桶处理 分段间隔过小,已自动设置为0.01 实际输入的参考文本: 海 过 海 关 不 要 紧 张 海 过 海 关 不 要 紧 张。 ############ 切分文本 ############ 实际输入的目标文本: 买体检套餐,基础项目+肿瘤筛查,价格1280元,用医保支付 实际输入的目标文本(切句后): ['买体检套餐,', '基础项目+肿瘤筛查,', '价格1280元,', '用医保支付。'] ############ 提取文本Bert特征 ############ 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 81.63it/s] ############ 推理 ############ 前端处理后的文本(每句): ['买体检套餐,'] ############ 预测语义Token ############ 0%| | 0/1500 [00:00<?, ?it/s]T2S Decoding EOS [136 -> 138] 0%| | 1/1500 [00:00<00:28, 52.63it/s] ############ 合成音频 ############ 并行合成中... 前端处理后的文本(每句): ['基础项目加肿瘤筛查,'] ############ 预测语义Token ############ 0%| | 0/1500 [00:00<?, ?it/s]T2S Decoding EOS [136 -> 138] 0%| | 1/1500 [00:00<00:27, 53.91it/s] ############ 合成音频 ############ 并行合成中... 前端处理后的文本(每句): ['价格一千二百八十元,'] ############ 预测语义Token ############ 0%| | 0/1500 [00:00<?, ?it/s]T2S Decoding EOS [136 -> 138] 0%| | 1/1500 [00:00<00:20, 71.43it/s] ############ 合成音频 ############ 并行合成中... 前端处理后的文本(每句): ['用医保支付.'] ############ 预测语义Token ############ 0%| | 0/1500 [00:00<?, ?it/s]T2S Decoding EOS [136 -> 139] 0%| | 2/1500 [00:00<00:15, 95.25it/s] ############ 合成音频 ############ 并行合成中... 0.016 0.052 0.423 5.226 INFO: 127.0.0.1:63761 - "POST /tts HTTP/1.1" 200 OK

phoenixdna avatar Aug 13 '25 08:08 phoenixdna

So you are suggesting that the audio files you have are corrupted. I used to have this issue where a corrupted file is generated. I suspect that it might be the streaming mode. Since the "streaming" parameter in your payload has already been turned off, there is nothing that can been done. Maybe you should try toggling it on and off? I solved my issue and here's my payload.

    payload = {
        "text": text,                  
        "text_lang": lang,               
        "ref_audio_path": audiopath,         
        "aux_ref_audio_paths": [],    
        "prompt_text": prompt,            
        "prompt_lang": "ja",            
        "top_k": 15,                   
        "top_p": 0.95,                   
        "temperature": 0.7,
        "text_split_method": "cut0",  
        "batch_size": 1,             
        "batch_threshold": 0.75,      
        "split_bucket": True,         
        "speed_factor":1.0,           
        "streaming_mode": False,     
        "seed": -1,                  
        "parallel_infer": True,       
        "repetition_penalty": 1.5,   
        "sample_steps": 32,           
        "super_sampling": False       
    }

Another possible method is that you can try to convert the .wav into .mp3 automatically using the AudioSegment module from pydub before returning the audio file from a function.

    if response.status_code == 200:
        with open("output.wav", "wb") as f:
            f.write(response.content)
            f.close()
        print("Audio generated and saved to output.wav")
        print("Converting to .mp3.")
        audio = AudioSegment.from_wav("output.wav")
        audio.export("output.mp3", format="mp3")
        return "output.mp3", "Action completed."
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return("Unexpected Error")

lpkpaco avatar Sep 24 '25 00:09 lpkpaco