Question about: punctuation in script and voice mix data.
Thank you @v-nhandt21 for sharing the repo. I have two questions, if you have time, please help me.
-
Before
scriptis transformed intophonemethrough the functionvi2IPA_split, does it need to remove punctuation marks? Because I see that thevivosdataset has no punctuation. Assuming we don't need to remove the punctuation, will it affect the output? For example, silent is longer when there is a,mark. -
I see that
vivosdata mixes male and female voices, assuming my dataset only focuses on one gender and one voice, will this make the final output better?
Thank you @v-nhandt21 for sharing the repo. I have two questions, if you have time, please help me.
- Before
scriptis transformed intophonemethrough the functionvi2IPA_split, does it need to remove punctuation marks? Because I see that thevivosdataset has no punctuation. Assuming we don't need to remove the punctuation, will it affect the output? For example, silent is longer when there is a,mark.- I see that
vivosdata mixes male and female voices, assuming my dataset only focuses on one gender and one voice, will this make the final output better?
Hi @drlor2k:
- As my experiment, there are two main pause duration in speech synthesis, short and long pause, therefore I convert all punctuation to "," or "."
- The VIVOS is example script for training only, to train a voice cloning model from scratch, I think you should collect more than 200h audios in clean quality.
You can check out this Repo: https://github.com/thinhlpg/vixtts-demo , they provide available pretrain for fine-tuning
Thanks for your response @v-nhandt21, I tried https://github.com/thinhlpg/vixtts-demo, it's a great attempt but it lacks the necessary stability. I actually forgot that I could fine-tune it :v
hello @v-nhandt21, I have some takeaways from VITS2 and XTTS, can you give your opinion?
-
In terms of sound output quality, VITS may be better than XTTS.
-
XTTS is based on
text-to-tokenviatokenizer, so it covers almost all words, including words outside the training language, which gives it the ability to pronounce some common foreign words, as long as these words appear in the training data. In contrast, VITS depends ontext-to-phoneme, and thus foreign words almost always have no corresponding phoneme. -
Based on number
2.intuitively we should eliminate audio with foreign pronunciation, because:
- Audio contains sounds of foreign words
- On the contrary, the phoneme part is missing because the foreign word has been converted to
/ - This leads to inconsistencies between audio and text.
- How do we deal with out-of-phoneme of VITS?
- A quick way is to convert the word into parts that VITS can pronounce, for example
hellotohé lô. However, this is a superficial way because it does not solve the root of the problem. - If you have experience, can you give me a solution, that is, solve it from a training perspective, that is, make adjustments in the
viphonemepackage.
Thank you if you take the time to respond!
Hi @drlor2k ,
-
VITS is a model for one language and XTTS is a multilingual, but I think multilingual can not cover all norm phonemics in a practical product. Therefore, we still need to use a traditional method to control cases out of vocabulary.
-
Beside a dictionary checking method you mentioned in (4), you can try to use force alignment which is a model to predict phoneme, to train it, we only need a dictionary, in the inference stage, the model would predict any out of vocab: https://github.com/v-nhandt21/ViMFA
Thank you for your response @v-nhandt21, I have another question, can you help me?
I see that some speech2speech repo uses a very small val dataset (2 records each voice), basically I understand they want to overfit as much as possible with the voice that needs to be cloned.
With your repo, does the val dataset affect the training process?
- if yes: suppose my training data is 200h, what is the appropriate size of the
val dataset? - if no: maybe I should keep the
val datasetquite small to make the training process faster, what do you think?
Thank you for your response @v-nhandt21, I have another question, can you help me?
I see that some speech2speech repo uses a very small
val dataset(2 records each voice), basically I understand they want to overfit as much as possible with the voice that needs to be cloned.With your repo, does the
val datasetaffect the training process?
- if yes: suppose my training data is 200h, what is the appropriate size of the
val dataset?- if no: maybe I should keep the
val datasetquite small to make the training process faster, what do you think?
My repo is a variant/adaptation of VITS for Vietnamese, therefore it is text-to-speech and validation set has no effect!
For some repo like https://github.com/svc-develop-team/so-vits-svc?tab=readme-ov-file, it is speech-to-speech because they try to learn speech characteristics on small set then inject these feature to control style of output voice