GPT-SoVITS
GPT-SoVITS copied to clipboard
Cantonese Inference Support?
I noticed that the project provides an extensive set of functionalities for voice conversion and text-to-speech, and I am specifically interested in using it for Cantonese text and speech processing, is that the gpt-sovits supports Cantonese inference ?
If GPT-SoVITS does not currently support Cantonese, I would greatly appreciate any guidance or insights on how to make it work effectively with Cantonese. Specifically, I would like to know if there are any available pre-trained models or if it's possible to fine-tune the existing models on Cantonese datasets? Many thanks!!
I will conduct research on Cantonese datasets and text frontend in the future.
Now the pretrained model doesn't use Cantonese data.
Thank you for your response. I'm glad to hear that you are open to exploring Cantonese datasets and text frontend in the future!! There should be some open-source Cantonese datasets available which could potentially be used for training or fine-tuning the model in the GPT-SoVITS, additionally, there are pre-trained BERT models specifically trained on Cantonese. They might be valuable for incorporating Cantonese support as well. Please let me know if you would like more information or if there's any way I can assist in the process. I'm excited about the possibility of Cantonese inference support in the project and would be happy to contribute.
Thank you for your response. I'm glad to hear that you are open to exploring Cantonese datasets and text frontend in the future!! There should be some open-source Cantonese datasets available which could potentially be used for training or fine-tuning the model in the GPT-SoVITS, additionally, there are pre-trained BERT models specifically trained on Cantonese. They might be valuable for incorporating Cantonese support as well. Please let me know if you would like more information or if there's any way I can assist in the process. I'm excited about the possibility of Cantonese inference support in the project and would be happy to contribute.
If convenient, could you please let me know which Cantonese datasets are available for training now and which repositories include the Cantonese text frontend? I'm conducting some research.
Hi, opensource cantonese dataset: https://magichub.com/datasets/guangzhou-cantonese-scripted-speech-corpus-daily-use-sentence/ For bert model, cantoformer may help https://github.com/paramiai/cantoformer
关于广东话相关的 这个我也可以帮忙 很期待哪一天喜欢的角色可以学会我的家乡话~
关于广东话相关的 这个我也可以帮忙 很期待哪一天喜欢的角色可以学会我的家乡话~
版主可分享如何創建一個廣東話模型配合sovits gpts? 相信用多朋友可幫手, 我們這邊也一定數量的數據集, 但不知如果入手....
似乎需要训练一个底模以及一个GPT 模型 周末看看
Hi @CloudTronUSA I stumbled upon an open-source base model at https://huggingface.co/xiaomaiiwn/vits-cantonese/tree/main/model, which comes with G.pth and D.pth files. Do you think it's possible to utilize this as the Pretrained SoVITS-G and SoVITS-D model paths? However, crafting a GPT model is somewhat outside of my wheelhouse. On another note, I have successfully implemented a fine-tuned whisper Cantonese model to serve as ASR for this project, and I'd be thrilled to share my experiences if you're interested....hope in somedays this project can really support Cantonese
I used sovits to convert the voice on Mozilla Common Voice dataset, and trained a vits model. Will this help?
Although RVC might be better in this case, but RVC have no pitch predict, so I cannot really do batch processing in here...
https://github.com/wenet-e2e/wetts/assets/52615455/4a851a3d-aa5a-4ae9-9aca-67659267beb6
Hi @Naozumi520 , did you fine tune a base model or train from scratch?
I used sovits to convert the voice on Mozilla Common Voice dataset, and trained a vits model. Will this help?
Although RVC might be better in this case, but RVC have no pitch predict, so I cannot really do batch processing in here...
1700493557_._nozomiCantonese.mov
i think the main focus should be making a GPT model that supports Cantonese generation sovits itself should have no problem adapting to Cantonese - there are a lot of models available already - we just need to retrain it with cantonese data added in.
Hi @CloudTronUSA I stumbled upon an open-source base model at https://huggingface.co/xiaomaiiwn/vits-cantonese/tree/main/model, which comes with G.pth and D.pth files. Do you think it's possible to utilize this as the Pretrained SoVITS-G and SoVITS-D model paths? However, crafting a GPT model is somewhat outside of my wheelhouse. On another note, I have successfully implemented a fine-tuned whisper Cantonese model to serve as ASR for this project, and I'd be thrilled to share my experiences if you're interested....hope in somedays this project can really support Cantonese
we might have to retrain our own D G model from scratch
@RVC-Boss 大佬能否详细说一下 GPT 模型 和 SoVITS 模型 在训练过程中是怎么配合工作的?以及数据是怎么处理的? 我试试看能不能研究一下广东话合成。 某游戏的新角色用普通话读粤语台词实在是太尴尬了...... 得把语音替换了不然难受死
Hi @Naozumi520 , did you fine tune a base model or train from scratch?
I used sovits to convert the voice on Mozilla Common Voice dataset, and trained a vits model. Will this help? Although RVC might be better in this case, but RVC have no pitch predict, so I cannot really do batch processing in here... 1700493557_._nozomiCantonese.mov
I train the model from scratch.
i think the main focus should be making a GPT model that supports Cantonese generation sovits itself should have no problem adapting to Cantonese - there are a lot of models available already - we just need to retrain it with cantonese data added in.
Well, I have trained some canton speaker model using So vits svc/RVC for voice cloning before and it work quite well, but I' am not sure if they could be further use for TTS. And for GPT model, suppose it need bert + wav2vec preprocessing / grapheme-to-phoneme conversion ? Hope RVC-Boss can give some insight here, thanks.....
@drymass2023 I found this by RVC-BOSS from another issue, might help
train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))
fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)
inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)
As well as this graph for AR (the GPT model)
训练好了整理一下发出来呀大佬 发件人: CloudTron ***@***.***>日期: 星期三, 2024年1月31日 02:07收件人: RVC-Boss/GPT-SoVITS ***@***.***>抄送: Subscribed ***@***.***>主题: Re: [RVC-Boss/GPT-SoVITS] Cantonese Inference Support? (Issue #189)似乎需要训练一个底模以及一个GPT 模型周末看看—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
@drymass2023 I found this by RVC-BOSS from another issue, might help
train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))
fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)
inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)
As well as this graph for AR (the GPT model)
Yes, if I am not wrong, under the directory GPT-SoVITS\GPT_SoVITS\prepare_datasets, there are 1-get-text.py for doing the data cleaner job and using bert model for text feature (I am trying to apply Cantonese text clearner and Cantonese bert of it) , 2-get-hubert-wav32k.py is to extract the audio feature using hubert (I am trying to use Cantonese Wav2Vec to obtain similar feature), 3.semantic.py should mapping the feature obtained from step1, step2 to the pre-training model....
I also interesting yr cantonese project and happy to contribute.
I suggest we create either a Telegram or Discord group for brainstorming ideas, sharing resources, and coordinating efforts to contribute to this endeavor... below is the TG link for discussion: https://t.me/+EcFI6Mxolos4ZGZl
@drymass2023 here's a discord server i created: https://discord.gg/45YufbXJ PS: for some reason your TG channel seems to be Read-Only
@RVC-Boss 大佬能否详细说一下 GPT 模型 和 SoVITS 模型 在训练过程中是怎么配合工作的?以及数据是怎么处理的? 我试试看能不能研究一下广东话合成。 某游戏的新角色用普通话读粤语台词实在是太尴尬了...... 得把语音替换了不然难受死
你在后面翻到了一个另外的issue,就是按照issue里面的方式运作的
@drymass2023 I found this by RVC-BOSS from another issue, might help train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.)) fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt) inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder) As well as this graph for AR (the GPT model)
Yes, if I am not wrong, under the directory GPT-SoVITS\GPT_SoVITS\prepare_datasets, there are 1-get-text.py for doing the data cleaner job and using bert model for text feature (I am trying to apply Cantonese text clearner and Cantonese bert of it) , 2-get-hubert-wav32k.py is to extract the audio feature using hubert (I am trying to use Cantonese Wav2Vec to obtain similar feature), 3.semantic.py should mapping the feature obtained from step1, step2 to the pre-training model....
@drymass2023 Bert model is additional (set zero(just like English and Japanese) or choose a cantonese bert model(I don't know if the performence will increase. setting zero is more secure.)). As for hubert model, just using default cn_hubert is ok.
But the most necessary thing is a cantonese g2p(text cleaner) and a cantonese speech dataset(目前阶段时长越多越好,magichub上的我看描述只有4小时,多些应该会更好). I find one text cleaner but I don't know if we can use it succesfully. https://github.com/CjangCjengh/vits/blob/main/text/cantonese.py
不过更建议的是,更懂粤语的大家一起来贡献高文本质量和时长的数据集和文本前端,我只负责训练,因为我肯定不比楼层里的大家更懂粤语
楼上提到的commonvoice数据集里广东话的部分,文本质量怎么样,有人调研过吗
@RVC-Boss 如果你有Discord的话可以加入这个服务器 https://discord.gg/mSpHcank 里面有一个专门的频道来讨论粤语支持
我找到了一个包 pycantonese 可以把广东话转成 consonants and vowel sound + tone 格式
楼上提到的commonvoice数据集里广东话的部分,文本质量怎么样,有人调研过吗
文本质量100%准确, 因为不是转译的,而是先有文本贡献者再跟着读。 而且用的版本是贡献者人手verify的。
https://commonvoice.mozilla.org/zh-HK/datasets
最近一次已经验证的数据有108个小时