GPT-SoVITS
GPT-SoVITS copied to clipboard
请问这个模型能做低延迟的文本转语音功能么?类似实时流式输出转语音
请问这个模型能做低延迟的文本转语音功能么?类似实时流式输出转语音,有研究过的大佬嘛谢谢了
同问
别人就是做这个的,智啊
对楼上的回答很无语,人家想问的是能不能做到实时语音输出,如Kokoro这样,可惜目前无法自己训练,中文支持也一般
我很早以前用过,速度还真可以,cpu都可以多开,现在就不知道了,但根据他的原理,实时是可能做到的
可以伪流式输出,第一次推理启动比较慢,后续一般来说生成速度都是远快于语音播放速度的,可以一直推
可以伪流式输出,第一次推理启动比较慢,后续一般来说生成速度都是远快于语音播放速度的,可以一直推
但是如果做语音助手之类的东西就挺麻烦的,那玩意对首个回复实时性要求挺高 而且有点好奇如果切割过多会不会出现不连贯问题
请问这个模型能做低延迟的文本转语音功能么?类似实时流式输出转语音,有研究过的大佬嘛谢谢了
对机器回答的文本进行分块,而不是传一个非常大的文本给服务端。在前端进行拼接就可以。我搭了一个对话页面,用GPT-SoVITS克隆了自己的声音作为回答,实时性还可以。代码可供参考:PsychoSolace
做时时的话要改这里:GPT_SoVITS\AR\models\t2s_model.py
# 进度条 #wll-----------
for idx in tqdm(range(1500)):
# for idx in tqdm(range(2)):
if xy_attn_mask is not None:
xy_dec, k_cache, v_cache = self.t2s_transformer.process_prompt(xy_pos, xy_attn_mask, None)
else:
xy_dec, k_cache, v_cache = self.t2s_transformer.decode_next_token(xy_pos, k_cache, v_cache)
logits = self.ar_predict_layer( xy_dec[:, -1] )
if idx == 0:
xy_attn_mask = None
if(idx<11):###至少预测出10个token不然不给停止(0.4s)
logits = logits[:, :-1]
samples = sample( logits, y, top_k=top_k, top_p=top_p, repetition_penalty=repetition_penalty, temperature=temperature )[0]
y = torch.concat([y, samples], dim=1)
if early_stop_num != -1 and (y.shape[1] - prefix_len) > early_stop_num:
print("use early stop num:", early_stop_num)
stop = True
if torch.argmax(logits, dim=-1)[0] == self.EOS or samples[0, 0] == self.EOS:
stop = True
if stop:
if y.shape[1] == 0:
print('-----------------------stop-----------',torch.zeros_like(samples) )
exit()
y = torch.concat([y, torch.zeros_like(samples)], dim=1)
print("bad zero prediction")
# print(f"T2S +++++++++++++++++++ Decoding EOS [{prefix_len} -> {y.shape[1]}]")
break
else:
print('-----------------------时时输出-----------',samples,idx )
#这里就是生成声音的逻辑,不过我不会改,如何把它运行起来,我现在尝试在这里播放声音~~................................
# phones = batch_phones[i].unsqueeze(0).to(self.configs.device)
#_pred_semantic = (samples.unsqueeze(0).unsqueeze(0)) # .unsqueeze(0)#mq要多unsqueeze一次
##audio_fragment =(self.vits_model.decode( _pred_semantic, phones, refer_audio_spec, speed=speed_factor ).detach()[0, 0, :])
####################### update next step ###################################
y_emb = self.ar_audio_embedding(y[:, -1:])
xy_pos = y_emb * self.ar_audio_position.x_scale + self.ar_audio_position.alpha * self.ar_audio_position.pe[:, y_len + idx].to(dtype=y_emb.dtype,device=y_emb.device)
# pred_semantic_list, idx_list = return y[:, :-1], 0
print('----------------100-----------------------',ref_free,'======',y[:, :-1], 0,"+++++++",y[:, :-1], idx - 1)
if ref_free:
return y[:, :-1], 0
return y[:, :-1], idx - 1
经过修改后能做到音频首chunk 300-500ms,挺够用了
加速后5090在150ms左右
加速后5090在150ms左右
尊嘟假嘟啊佬,是导出torchscript后用libtorch推理的吗,AR的decode速度会起飞?😚,我生成音频都快150ms了
尊嘟假嘟啊佬,是导出torchscript后用libtorch推理的吗,AR的decode速度会起飞?😚,我生成音频都快150ms了
我主页有
hey guys! I know this issue is regarding other stuff, but does anyone could train for other languages? Is it possible to train for Spanish, French etc? I really appreciate response. Regards
hey guys! I know this issue is regarding other stuff, but does anyone could train for other languages? Is it possible to train for Spanish, French etc? I really appreciate response. Regards
GitHub Wikis
hey @XXXXRT666 , don't understand sorry, what you mean with GitHub Wikis? is there anything out there to train this modelo r even other people that already did it? Thank you !
hey @XXXXRT666 , don't understand sorry, what you mean with GitHub Wikis? is there anything out there to train this modelo r even other people that already did it? Thank you !
GitHub wiki of this repo
尊嘟假嘟啊佬,是导出torchscript后用libtorch推理的吗,AR的decode速度会起飞?😚,我生成音频都快150ms了
https://github.com/XXXXRT666/GPT-SoVITS/tree/Accel#infer-speed