GPT-SoVITS icon indicating copy to clipboard operation
GPT-SoVITS copied to clipboard

The model structure of TTS

Open WinterStraw opened this issue 1 year ago • 1 comments

I noticed that the author shared the explanation video, but it is about the principle sharing of clone. Therefore I would like to ask if there will be any shared explanation and structure diagram of TTS part in the future. Thank you very much!

WinterStraw avatar Jan 19 '24 00:01 WinterStraw

train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))

fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)

inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)

RVC-Boss avatar Jan 19 '24 02:01 RVC-Boss

thanks ~!!

zhuangzhuangliu2345 avatar Jan 29 '24 02:01 zhuangzhuangliu2345

train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))

fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)

inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)

完全没接触过这方面的知识,该怎么一步步学哪些东西才能看得懂您的这个介绍呢

45xjh avatar Mar 12 '24 10:03 45xjh

train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))

fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)

inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)

mark,感谢。

ZhangJianBeiJing avatar Apr 16 '24 07:04 ZhangJianBeiJing

请问大佬有哪里可以看到比较详细的模型的训练、推理的结构图呀?或者方便提供吗?单看2楼的文字描述有点不太清晰

我自己找到了https://zhuanlan.zhihu.com/p/684120282 这个推理结构图,请问是正确的吗

SummerXIATIAN avatar May 24 '24 15:05 SummerXIATIAN