Bert-VITS2
Bert-VITS2 copied to clipboard
Refactor repo,support modular feature
In view of the chaotic version management of Bert-VITS2, we plan to reconstruct Bert-VITS2 in 2023Q4-2024Q1 and modularize each feature. At the same time, pure Chinese generation will be improved to achieve the best results from the release.
Ask for helps. Firstly, It can not run stablelly when enable the net_dur_disc, How did you train the base model? The second, Why to use roberta-large instead of some smaller and quicker bert-like models? and why to use the hiddens of the -2, -3 layers instead of the final layer? The third, In my runing result, I train it from scrach. but adding the BERT emb into phoneme embedding in the TextEncoder, will result in error utterance of words at last, how to slove such problem? give more training dataset?
Ask for helps. Firstly, It can not run stablelly when enable the net_dur_disc, How did you train the base model? The second, Why to use roberta-large instead of some smaller and quicker bert-like models? and why to use the hiddens of the -2, -3 layers instead of the final layer? The third, In my runing result, I train it from scrach. but adding the BERT emb into phoneme embedding in the TextEncoder, will result in error utterance of words at last, how to slove such problem? give more training dataset?
For the first.We add net_dur_disc in the latest 50k/30k steps. Second,we propose the larger bert model to get better rhythm,We use the -2 -3 layers because "The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore may be biased to those targets. If you question about this argument and want to use the last hidden layer anyway" (https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-the-last-hidden-layer-why-second-to-last) For the third,I am not sure which language you use and how many data you used,For my experience,I never encounter problems.
Ask for helps. Firstly, It can not run stablelly when enable the net_dur_disc, How did you train the base model? The second, Why to use roberta-large instead of some smaller and quicker bert-like models? and why to use the hiddens of the -2, -3 layers instead of the final layer? The third, In my runing result, I train it from scrach. but adding the BERT emb into phoneme embedding in the TextEncoder, will result in error utterance of words at last, how to slove such problem? give more training dataset?
Add.If your model can not pronounce correctly any single words,We suggest you to train a base model,which don't have BERT emb,and add BERT emb after your model can pronounce words.
Ask for helps. Firstly, It can not run stablelly when enable the net_dur_disc, How did you train the base model? The second, Why to use roberta-large instead of some smaller and quicker bert-like models? and why to use the hiddens of the -2, -3 layers instead of the final layer? The third, In my runing result, I train it from scrach. but adding the BERT emb into phoneme embedding in the TextEncoder, will result in error utterance of words at last, how to slove such problem? give more training dataset?
For the first.We add net_dur_disc in the latest 50k/30k steps. Second,we propose the larger bert model to get better rhythm,We use the -2 -3 layers because "The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore may be biased to those targets. If you question about this argument and want to use the last hidden layer anyway" (https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-the-last-hidden-layer-why-second-to-last) For the third,I am not sure which language you use and how many data you used,For my experience,I never encounter problems.
Thanks for the help. I am using the Mandarin dataset, we are training on a 14-speaker dataset with about 50K samples
Ask for helps. Firstly, It can not run stablelly when enable the net_dur_disc, How did you train the base model? The second, Why to use roberta-large instead of some smaller and quicker bert-like models? and why to use the hiddens of the -2, -3 layers instead of the final layer? The third, In my runing result, I train it from scrach. but adding the BERT emb into phoneme embedding in the TextEncoder, will result in error utterance of words at last, how to slove such problem? give more training dataset?
For the first.We add net_dur_disc in the latest 50k/30k steps. Second,we propose the larger bert model to get better rhythm,We use the -2 -3 layers because "The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore may be biased to those targets. If you question about this argument and want to use the last hidden layer anyway" (https://bert-as-service.readthedocs.io/en/latest/section/faq.html#why-not-the-last-hidden-layer-why-second-to-last) For the third,I am not sure which language you use and how many data you used,For my experience,I never encounter problems.
Thanks for the help. I am using the Mandarin dataset, we are training on a 14-speaker dataset with about 50K samples
For a base model ,That may be a little less.I propose you to load my base model and finetune your data.
Ask for helps. Firstly, It can not run stablelly when enable the net_dur_disc, How did you train the base model? The second, Why to use roberta-large instead of some smaller and quicker bert-like models? and why to use the hiddens of the -2, -3 layers instead of the final layer? The third, In my runing result, I train it from scrach. but adding the BERT emb into phoneme embedding in the TextEncoder, will result in error utterance of words at last, how to slove such problem? give more training dataset?
Add.If your model can not pronounce correctly any single words,We suggest you to train a base model,which don't have BERT emb,and add BERT emb after your model can pronounce words.
No, it can pronounce most of the words, but each sentence there are one or more characters are wrong. eg. GT: 比赛结果二比零。 GEN (sound like): 比赛结果二点零。 GT: 还有公平可言吗? GEN (sound like): 还有公平可怜吗? GT: 这加重了企业负担。 GEN(sound like): 这加大了企业负担。
If we train a base model without BERT and then with bert. how many steps we should train firstly? and should we take the duration-predictor trained there?
Ask for helps. Firstly, It can not run stablelly when enable the net_dur_disc, How did you train the base model? The second, Why to use roberta-large instead of some smaller and quicker bert-like models? and why to use the hiddens of the -2, -3 layers instead of the final layer? The third, In my runing result, I train it from scrach. but adding the BERT emb into phoneme embedding in the TextEncoder, will result in error utterance of words at last, how to slove such problem? give more training dataset?
Add.If your model can not pronounce correctly any single words,We suggest you to train a base model,which don't have BERT emb,and add BERT emb after your model can pronounce words.
No, it can pronounce most of the words, but each sentence there are one or more characters are wrong. eg. GT: 比赛结果二比零。 GEN (sound like): 比赛结果二点零。 GT: 还有公平可言吗? GEN (sound like): 还有公平可怜吗? GT: 这加重了企业负担。 GEN(sound like): 这加大了企业负担。
If we train a base model without BERT and then with bert. how many steps we should train firstly? and should we take the duration-predictor trained there?
For your problem,I think what should be blamed is the g2p or dataset. I am not sure how to fix it,bug you can finetune your data on my base model and try again.
And another question, I am using the Bert-VITS2 v1.1, since I do not need multi-language feature. I see most codes of the model in the version are the same like the in the project https://github.com/p0p4k/vits2_pytorch. I am now noticed that the project had upgrade into V2.3 now, It add WavLMDiscriminator like in StyleTTS2, and CLAP based emotion embedding. but as I know there are some training-stable problems in the vits2_pytorch project, Is there any change to improve or slove those problems?
We re-implemented the duration discriminator since the original approach in p0p4k/vits2_pytorch is not practical per our experiments (we still don't know the exact structure of the DD in the VITS2 paper as no official implementation is available as of now). CLAP is removed in 2.3 since it's not functioning as we expected.
As to train a single-language Bert-VITS2 v2.3 from scratch, you can modify the text encoder to remove the English and Japanese BERT features projection layers and train w/o DD and SLM discriminator at initial steps to acquire a pre-trained baseline model, then fine-tune the model w/ DD and SLM-D.
The training data should be no less than 200 hours per language for optimum performance. Hence we suggest fine-tuning our base model instead of training from scratch for most scenarios.
Thank you very much.
we modified our code models.py according to the v2.3 to add this code:
logw_sdp = self.sdp(x, x_mask, g=g, reverse=True, noise_scale=1.0)
l_length_sdp += torch.sum((logw_sdp - logw_) ** 2, [1, 2]) / torch.sum(x_mask)
but it seems do harm to the base model training. is that change verfified?
we modified our code models.py according to the v2.3 to add this code:
logw_sdp = self.sdp(x, x_mask, g=g, reverse=True, noise_scale=1.0) l_length_sdp += torch.sum((logw_sdp - logw_) ** 2, [1, 2]) / torch.sum(x_mask)
but it seems do harm to the base model training. is that change verfified?
This is used for DD.
we modified our code models.py according to the v2.3 to add this code:
logw_sdp = self.sdp(x, x_mask, g=g, reverse=True, noise_scale=1.0) l_length_sdp += torch.sum((logw_sdp - logw_) ** 2, [1, 2]) / torch.sum(x_mask)
but it seems do harm to the base model training. is that change verfified?
This is used for DD.
请问训练Base Model的时候,需要修改哪些地方呢? 就奇怪的很,有的时候训练5W步左右能听到字了。10W左右句子成型了,但是12W步左右,训练挂起了,也不崩溃,所有GPU100%,没有继续走下去。 有时候可以正常走下去,但是训练走到5W步,10W步,听起来还都是噪音,啥内容也没有。但是损失下降的挺正常,mel降低到22左右, KL散度 3-5, duration loss也是2-4左右。 这些到底是哪块出了问题?还是这个项目训练Base Model的时候,本来就是不稳定的?