GPT-SoVITS Cantonese Inference Support?

I noticed that the project provides an extensive set of functionalities for voice conversion and text-to-speech, and I am specifically interested in using it for Cantonese text and speech processing, is that the gpt-sovits supports Cantonese inference ?

If GPT-SoVITS does not currently support Cantonese, I would greatly appreciate any guidance or insights on how to make it work effectively with Cantonese. Specifically, I would like to know if there are any available pre-trained models or if it's possible to fine-tune the existing models on Cantonese datasets? Many thanks!!

Jan 25 '24 03:01 drymass2023

I will conduct research on Cantonese datasets and text frontend in the future.

Jan 25 '24 03:01 RVC-Boss

Now the pretrained model doesn't use Cantonese data.

Jan 25 '24 03:01 RVC-Boss

Thank you for your response. I'm glad to hear that you are open to exploring Cantonese datasets and text frontend in the future!! There should be some open-source Cantonese datasets available which could potentially be used for training or fine-tuning the model in the GPT-SoVITS, additionally, there are pre-trained BERT models specifically trained on Cantonese. They might be valuable for incorporating Cantonese support as well. Please let me know if you would like more information or if there's any way I can assist in the process. I'm excited about the possibility of Cantonese inference support in the project and would be happy to contribute.

Jan 25 '24 03:01 drymass2023

Thank you for your response. I'm glad to hear that you are open to exploring Cantonese datasets and text frontend in the future!! There should be some open-source Cantonese datasets available which could potentially be used for training or fine-tuning the model in the GPT-SoVITS, additionally, there are pre-trained BERT models specifically trained on Cantonese. They might be valuable for incorporating Cantonese support as well. Please let me know if you would like more information or if there's any way I can assist in the process. I'm excited about the possibility of Cantonese inference support in the project and would be happy to contribute.

If convenient, could you please let me know which Cantonese datasets are available for training now and which repositories include the Cantonese text frontend? I'm conducting some research.

Jan 25 '24 04:01 RVC-Boss

Hi, opensource cantonese dataset: https://magichub.com/datasets/guangzhou-cantonese-scripted-speech-corpus-daily-use-sentence/ For bert model, cantoformer may help https://github.com/paramiai/cantoformer

Jan 25 '24 04:01 drymass2023

关于广东话相关的这个我也可以帮忙很期待哪一天喜欢的角色可以学会我的家乡话~

Jan 30 '24 01:01 CloudTronUSA

关于广东话相关的这个我也可以帮忙很期待哪一天喜欢的角色可以学会我的家乡话~

版主可分享如何創建一個廣東話模型配合sovits gpts? 相信用多朋友可幫手, 我們這邊也一定數量的數據集, 但不知如果入手....

Jan 30 '24 16:01 drymasscom

似乎需要训练一个底模以及一个GPT 模型周末看看

Jan 30 '24 18:01 CloudTronUSA

Hi @CloudTronUSA I stumbled upon an open-source base model at https://huggingface.co/xiaomaiiwn/vits-cantonese/tree/main/model, which comes with G.pth and D.pth files. Do you think it's possible to utilize this as the Pretrained SoVITS-G and SoVITS-D model paths? However, crafting a GPT model is somewhat outside of my wheelhouse. On another note, I have successfully implemented a fine-tuned whisper Cantonese model to serve as ASR for this project, and I'd be thrilled to share my experiences if you're interested....hope in somedays this project can really support Cantonese

Feb 01 '24 08:02 drymass2023

I used sovits to convert the voice on Mozilla Common Voice dataset, and trained a vits model. Will this help?

Although RVC might be better in this case, but RVC have no pitch predict, so I cannot really do batch processing in here...

https://github.com/wenet-e2e/wetts/assets/52615455/4a851a3d-aa5a-4ae9-9aca-67659267beb6

Feb 04 '24 06:02 Naozumi520

Hi @Naozumi520 , did you fine tune a base model or train from scratch?

I used sovits to convert the voice on Mozilla Common Voice dataset, and trained a vits model. Will this help?

Although RVC might be better in this case, but RVC have no pitch predict, so I cannot really do batch processing in here...

1700493557_._nozomiCantonese.mov

Feb 05 '24 03:02 drymass2023

i think the main focus should be making a GPT model that supports Cantonese generation sovits itself should have no problem adapting to Cantonese - there are a lot of models available already - we just need to retrain it with cantonese data added in.

Feb 06 '24 00:02 CloudTronUSA

Hi @CloudTronUSA I stumbled upon an open-source base model at https://huggingface.co/xiaomaiiwn/vits-cantonese/tree/main/model, which comes with G.pth and D.pth files. Do you think it's possible to utilize this as the Pretrained SoVITS-G and SoVITS-D model paths? However, crafting a GPT model is somewhat outside of my wheelhouse. On another note, I have successfully implemented a fine-tuned whisper Cantonese model to serve as ASR for this project, and I'd be thrilled to share my experiences if you're interested....hope in somedays this project can really support Cantonese

we might have to retrain our own D G model from scratch

Feb 06 '24 00:02 CloudTronUSA

@RVC-Boss 大佬能否详细说一下 GPT 模型和 SoVITS 模型在训练过程中是怎么配合工作的？以及数据是怎么处理的？我试试看能不能研究一下广东话合成。某游戏的新角色用普通话读粤语台词实在是太尴尬了...... 得把语音替换了不然难受死

Feb 06 '24 00:02 CloudTronUSA

Hi @Naozumi520 , did you fine tune a base model or train from scratch?

I used sovits to convert the voice on Mozilla Common Voice dataset, and trained a vits model. Will this help? Although RVC might be better in this case, but RVC have no pitch predict, so I cannot really do batch processing in here... 1700493557_._nozomiCantonese.mov

I train the model from scratch.

Feb 06 '24 04:02 Naozumi520

i think the main focus should be making a GPT model that supports Cantonese generation sovits itself should have no problem adapting to Cantonese - there are a lot of models available already - we just need to retrain it with cantonese data added in.

Well, I have trained some canton speaker model using So vits svc/RVC for voice cloning before and it work quite well, but I' am not sure if they could be further use for TTS. And for GPT model, suppose it need bert + wav2vec preprocessing / grapheme-to-phoneme conversion ? Hope RVC-Boss can give some insight here, thanks.....

Feb 06 '24 07:02 drymass2023

@drymass2023 I found this by RVC-BOSS from another issue, might help

train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))

fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)

inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)

As well as this graph for AR (the GPT model)

Feb 06 '24 14:02 CloudTronUSA

训练好了整理一下发出来呀大佬发件人: CloudTron ***@***.***>日期: 星期三, 2024年1月31日 02:07收件人: RVC-Boss/GPT-SoVITS ***@***.***>抄送: Subscribed ***@***.***>主题: Re: [RVC-Boss/GPT-SoVITS] Cantonese Inference Support? (Issue #189)似乎需要训练一个底模以及一个GPT 模型周末看看—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Feb 06 '24 15:02 Stanley-baby

@drymass2023 I found this by RVC-BOSS from another issue, might help

train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.))

fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt)

inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder)

As well as this graph for AR (the GPT model)

Yes, if I am not wrong, under the directory GPT-SoVITS\GPT_SoVITS\prepare_datasets, there are 1-get-text.py for doing the data cleaner job and using bert model for text feature (I am trying to apply Cantonese text clearner and Cantonese bert of it) , 2-get-hubert-wav32k.py is to extract the audio feature using hubert (I am trying to use Cantonese Wav2Vec to obtain similar feature), 3.semantic.py should mapping the feature obtained from step1, step2 to the pre-training model....

Feb 06 '24 16:02 drymass2023

I also interesting yr cantonese project and happy to contribute.

Feb 08 '24 06:02 freeman6789

I suggest we create either a Telegram or Discord group for brainstorming ideas, sharing resources, and coordinating efforts to contribute to this endeavor... below is the TG link for discussion: https://t.me/+EcFI6Mxolos4ZGZl

Feb 10 '24 14:02 drymass2023

@drymass2023 here's a discord server i created: https://discord.gg/45YufbXJ PS: for some reason your TG channel seems to be Read-Only

Feb 10 '24 18:02 CloudTronUSA

@RVC-Boss 大佬能否详细说一下 GPT 模型和 SoVITS 模型在训练过程中是怎么配合工作的？以及数据是怎么处理的？我试试看能不能研究一下广东话合成。某游戏的新角色用普通话读粤语台词实在是太尴尬了...... 得把语音替换了不然难受死

你在后面翻到了一个另外的issue，就是按照issue里面的方式运作的

Feb 13 '24 14:02 RVC-Boss

@drymass2023 I found this by RVC-BOSS from another issue, might help train: preprocess_stage1:wav->hubert,text->bert stage1: hubert->token----(+text+reference_encoder_embedding)---->wav (sovits) preprocess_stage2:hubert->token stage2: tokens+bert+text->tokens (gpt (More accurately, it is Soundstorm stage_AR.)) fine tune: preprocess_stage:wav->hubert->token,text->bert stage1: token------(+text+reference_encoder_embedding)----->wav (sovits_decoder) stage2: tokens+bert+text->tokens (gpt) inference: text->bert prompt_wav->prompt_token (sovits_encoder) prompt_token+todo_text+todo_bert->completed token (gpt) completed token+todo_text+reference_encoder_embedding->output vocal (sovits_decoder) As well as this graph for AR (the GPT model)

Yes, if I am not wrong, under the directory GPT-SoVITS\GPT_SoVITS\prepare_datasets, there are 1-get-text.py for doing the data cleaner job and using bert model for text feature (I am trying to apply Cantonese text clearner and Cantonese bert of it) , 2-get-hubert-wav32k.py is to extract the audio feature using hubert (I am trying to use Cantonese Wav2Vec to obtain similar feature), 3.semantic.py should mapping the feature obtained from step1, step2 to the pre-training model....

@drymass2023 Bert model is additional (set zero(just like English and Japanese) or choose a cantonese bert model(I don't know if the performence will increase. setting zero is more secure.)). As for hubert model, just using default cn_hubert is ok.

But the most necessary thing is a cantonese g2p(text cleaner) and a cantonese speech dataset(目前阶段时长越多越好，magichub上的我看描述只有4小时，多些应该会更好). I find one text cleaner but I don't know if we can use it succesfully. https://github.com/CjangCjengh/vits/blob/main/text/cantonese.py

Feb 13 '24 14:02 RVC-Boss

不过更建议的是，更懂粤语的大家一起来贡献高文本质量和时长的数据集和文本前端，我只负责训练，因为我肯定不比楼层里的大家更懂粤语

Feb 13 '24 14:02 RVC-Boss

楼上提到的commonvoice数据集里广东话的部分，文本质量怎么样，有人调研过吗

Feb 13 '24 14:02 RVC-Boss

@RVC-Boss 如果你有Discord的话可以加入这个服务器 https://discord.gg/mSpHcank 里面有一个专门的频道来讨论粤语支持

Feb 13 '24 16:02 CloudTronUSA

我找到了一个包 pycantonese 可以把广东话转成 consonants and vowel sound + tone 格式

Feb 13 '24 16:02 CloudTronUSA

楼上提到的commonvoice数据集里广东话的部分，文本质量怎么样，有人调研过吗

文本质量100%准确, 因为不是转译的，而是先有文本贡献者再跟着读。而且用的版本是贡献者人手verify的。

Feb 14 '24 04:02 Naozumi520

https://commonvoice.mozilla.org/zh-HK/datasets

最近一次已经验证的数据有108个小时

Feb 14 '24 04:02 Naozumi520

GPT-SoVITS GPT-SoVITS copied to clipboard

Cantonese Inference Support?

GPT-SoVITS
GPT-SoVITS copied to clipboard