GPT-SoVITS The model structure of TTS

I noticed that the author shared the explanation video, but it is about the principle sharing of clone. Therefore I would like to ask if there will be any shared explanation and structure diagram of TTS part in the future. Thank you very much!

Jan 19 '24 00:01 WinterStraw

train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))

fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)

inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)

Jan 19 '24 02:01 RVC-Boss

thanks ~！！

Jan 29 '24 02:01 zhuangzhuangliu2345

train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))

fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)

inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)

完全没接触过这方面的知识，该怎么一步步学哪些东西才能看得懂您的这个介绍呢

Mar 12 '24 10:03 45xjh

train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))

fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)

inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)

mark，感谢。

Apr 16 '24 07:04 ZhangJianBeiJing

请问大佬有哪里可以看到比较详细的模型的训练、推理的结构图呀？或者方便提供吗？单看2楼的文字描述有点不太清晰

我自己找到了https://zhuanlan.zhihu.com/p/684120282 这个推理结构图，请问是正确的吗

May 24 '24 15:05 SummerXIATIAN

GPT-SoVITS GPT-SoVITS copied to clipboard

The model structure of TTS

GPT-SoVITS
GPT-SoVITS copied to clipboard