GPT-SoVITS
GPT-SoVITS copied to clipboard
The model structure of TTS
I noticed that the author shared the explanation video, but it is about the principle sharing of clone. Therefore I would like to ask if there will be any shared explanation and structure diagram of TTS part in the future. Thank you very much!
train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))
fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)
inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)
thanks ~!!
train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))
fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)
inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)
完全没接触过这方面的知识,该怎么一步步学哪些东西才能看得懂您的这个介绍呢
train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))
fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)
inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)
mark,感谢。
请问大佬有哪里可以看到比较详细的模型的训练、推理的结构图呀?或者方便提供吗?单看2楼的文字描述有点不太清晰
我自己找到了https://zhuanlan.zhihu.com/p/684120282 这个推理结构图,请问是正确的吗