GPT-SoVITS 请问这个模型能做低延迟的文本转语音功能么？类似实时流式输出转语音

请问这个模型能做低延迟的文本转语音功能么？类似实时流式输出转语音，有研究过的大佬嘛谢谢了

Jan 04 '25 09:01 Rosejacka

同问

Jan 14 '25 11:01 pxy1118

别人就是做这个的，智啊

Jan 17 '25 01:01 lin-jj-engine

对楼上的回答很无语，人家想问的是能不能做到实时语音输出，如Kokoro这样，可惜目前无法自己训练，中文支持也一般

Jan 21 '25 03:01 fangg2000

我很早以前用过，速度还真可以，cpu都可以多开，现在就不知道了，但根据他的原理，实时是可能做到的

Jan 23 '25 10:01 jasonzhang761213

可以伪流式输出，第一次推理启动比较慢，后续一般来说生成速度都是远快于语音播放速度的，可以一直推

Feb 13 '25 19:02 KamioRinn

可以伪流式输出，第一次推理启动比较慢，后续一般来说生成速度都是远快于语音播放速度的，可以一直推

但是如果做语音助手之类的东西就挺麻烦的，那玩意对首个回复实时性要求挺高而且有点好奇如果切割过多会不会出现不连贯问题

Mar 06 '25 12:03 MC-XiaoHei

请问这个模型能做低延迟的文本转语音功能么？类似实时流式输出转语音，有研究过的大佬嘛谢谢了

对机器回答的文本进行分块，而不是传一个非常大的文本给服务端。在前端进行拼接就可以。我搭了一个对话页面，用GPT-SoVITS克隆了自己的声音作为回答，实时性还可以。代码可供参考：PsychoSolace

Apr 14 '25 08:04 hu-ke

做时时的话要改这里：GPT_SoVITS\AR\models\t2s_model.py


# 进度条 #wll-----------
        for idx in tqdm(range(1500)):
        # for idx in tqdm(range(2)):
            if xy_attn_mask is not None:
                xy_dec, k_cache, v_cache = self.t2s_transformer.process_prompt(xy_pos, xy_attn_mask, None)
            else:
                xy_dec, k_cache, v_cache = self.t2s_transformer.decode_next_token(xy_pos, k_cache, v_cache)

            logits = self.ar_predict_layer(    xy_dec[:, -1]   )

            if idx == 0:
                xy_attn_mask = None
            if(idx<11):###至少预测出10个token不然不给停止（0.4s）
                logits = logits[:, :-1]

            samples = sample( logits, y, top_k=top_k, top_p=top_p, repetition_penalty=repetition_penalty, temperature=temperature )[0]

            y = torch.concat([y, samples], dim=1)

            if early_stop_num != -1 and (y.shape[1] - prefix_len) > early_stop_num:
                print("use early stop num:", early_stop_num)
                stop = True

            if torch.argmax(logits, dim=-1)[0] == self.EOS or samples[0, 0] == self.EOS:
                stop = True

            if stop:
                if y.shape[1] == 0:
                    
                    print('-----------------------stop-----------',torch.zeros_like(samples) )
                    exit()

                    y = torch.concat([y, torch.zeros_like(samples)], dim=1)
                    print("bad zero prediction")
                # print(f"T2S +++++++++++++++++++ Decoding EOS [{prefix_len} -> {y.shape[1]}]")
                break
            else:
                print('-----------------------时时输出-----------',samples,idx )

                #这里就是生成声音的逻辑，不过我不会改，如何把它运行起来，我现在尝试在这里播放声音~~................................
               # phones = batch_phones[i].unsqueeze(0).to(self.configs.device)
                #_pred_semantic = (samples.unsqueeze(0).unsqueeze(0))   # .unsqueeze(0)#mq要多unsqueeze一次
                ##audio_fragment =(self.vits_model.decode( _pred_semantic, phones, refer_audio_spec, speed=speed_factor ).detach()[0, 0, :])
               

            ####################### update next step ###################################
            y_emb = self.ar_audio_embedding(y[:, -1:])
            xy_pos = y_emb * self.ar_audio_position.x_scale + self.ar_audio_position.alpha * self.ar_audio_position.pe[:, y_len + idx].to(dtype=y_emb.dtype,device=y_emb.device)

        # pred_semantic_list, idx_list = return y[:, :-1], 0
        print('----------------100-----------------------',ref_free,'======',y[:, :-1], 0,"+++++++",y[:, :-1], idx - 1)

        if ref_free:
            return y[:, :-1], 0
        return y[:, :-1], idx - 1

May 24 '25 02:05 gg22mm

经过修改后能做到音频首chunk 300-500ms，挺够用了

Oct 27 '25 06:10 jsntcheng

加速后5090在150ms左右

Oct 27 '25 12:10 XXXXRT666

加速后5090在150ms左右

尊嘟假嘟啊佬，是导出torchscript后用libtorch推理的吗，AR的decode速度会起飞？😚，我生成音频都快150ms了

Oct 28 '25 08:10 jsntcheng

尊嘟假嘟啊佬，是导出torchscript后用libtorch推理的吗，AR的decode速度会起飞？😚，我生成音频都快150ms了

我主页有

Oct 30 '25 03:10 XXXXRT666

hey guys! I know this issue is regarding other stuff, but does anyone could train for other languages? Is it possible to train for Spanish, French etc? I really appreciate response. Regards

Oct 30 '25 06:10 jmdBB

hey guys! I know this issue is regarding other stuff, but does anyone could train for other languages? Is it possible to train for Spanish, French etc? I really appreciate response. Regards

GitHub Wikis

Oct 30 '25 14:10 XXXXRT666

hey @XXXXRT666 , don't understand sorry, what you mean with GitHub Wikis? is there anything out there to train this modelo r even other people that already did it? Thank you !

Oct 30 '25 15:10 jmdBB

hey @XXXXRT666 , don't understand sorry, what you mean with GitHub Wikis? is there anything out there to train this modelo r even other people that already did it? Thank you !

GitHub wiki of this repo

Oct 30 '25 15:10 XXXXRT666

尊嘟假嘟啊佬，是导出torchscript后用libtorch推理的吗，AR的decode速度会起飞？😚，我生成音频都快150ms了

https://github.com/XXXXRT666/GPT-SoVITS/tree/Accel#infer-speed

Nov 02 '25 06:11 XXXXRT666

GPT-SoVITS GPT-SoVITS copied to clipboard

请问这个模型能做低延迟的文本转语音功能么？类似实时流式输出转语音

GPT-SoVITS
GPT-SoVITS copied to clipboard