StyleTTS2
StyleTTS2 copied to clipboard
Fine-tuning or training from scratch in a differente language?
Hi everyone, I'm considering putting some effort into training StyleTTS in Portuguese. I have a good-quality dataset for this task, however, I was in doubt if it would be better just to fine-tune the model (which I know was trained in English), or (since it's an unseen language) train the model from scratch in Portuguese.
Does anyone have some tips on what I should consider before making a decision?
Definitelly train a new PLBERT for a new language. You can try with the one trained in English but even the author says it probably won't work.
Hi there -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert
Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!
Best of luck, and let me know what you make with this!
Thank you very much for this @rlenain . I'll use this model to train StyleTTS on my data
Hi there -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert
Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!
Best of luck, and let me know what you make with this!
Nice work!Did the Chinese data the model used for training include tone?
I'm not sure -- you can see a sample here (the data is from this dataset: https://huggingface.co/datasets/styletts2-community/multilingual-phonemes-10k-alpha/viewer/zh).
Thank you very much @rlenain! This is a great addition! You mentioned you can just finetune on a new language instead of training a new base model, I'd like to try it. How large are the datasets you used for the finetuning on a new language?
i tend to keep some english in the dataset (~5 hours) and have had success with as little as 20 hours of Spanish data split across 4 speakers
Where to see this 14 langs ?
На вт, 5.03.2024 г. в 19:26 Raphael Lenain @.***> написа:
i tend to keep some english in the dataset (~5 hours) and have had success with as little as 20 hours of Spanish data split across 4 speakers
— Reply to this email directly, view it on GitHub https://github.com/yl4579/StyleTTS2/issues/197#issuecomment-1979277514, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYTSUJO4TM5JKIW6AJPCETYWX53JAVCNFSM6AAAAABCRYTYCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZGI3TONJRGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
https://huggingface.co/papercup-ai/multilingual-pl-bert
Thanks
На ср, 6.03.2024 г. в 13:06 Raphael Lenain @.***> написа:
https://huggingface.co/papercup-ai/multilingual-pl-bert
— Reply to this email directly, view it on GitHub https://github.com/yl4579/StyleTTS2/issues/197#issuecomment-1980620269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYTSULL7ABSYHBRN7DKVWTYW32CNAVCNFSM6AAAAABCRYTYCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQGYZDAMRWHE . You are receiving this because you commented.Message ID: @.***>
@rlenain > i tend to keep some english in the dataset (~5 hours) and have had success with as little as 20 hours of Spanish data split across 4 speakers
thanks for the great work! do you have some samples to share? I'm very curious about the quality on a new language
Unfortunately because of the privacy policy of the samples that I trained on, I cannot share these samples here. What I can say is that the quality is very much on-par with samples you can find on the samples page in English.
Unfortunately because of the privacy policy of the samples that I trained on, I cannot share these samples here. What I can say is that the quality is very much on-par with samples you can find on the samples page in English.
I would like to ask three questions: Do the speakers in the dataset need to be in a numeric format, for example, speaker 0, 1, 2, and do they have to be different from 0, or can I put all of them with the same name or even in a string format like a name to facilitate the recognition of the speakers? The other question is, after training the speakers, to access them, do I need to define the speakers in the inference and what about the language selector is automatic?
@rlenain
i tend to keep some english in the dataset (~5 hours) and have had success with as little as 20 hours of Spanish data split across 4 speakers
@rlenain Do you mind sharing for how many epochs you fine-tuned?
@sch0ngut Generally for 50k-100k iterations, whatever that means in terms of epochs for the size of your dataset. But you should be following the validation curve.
Hi there -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert
Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!
Best of luck, and let me know what you make with this!
Hi there -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert
Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!
Best of luck, and let me know what you make with this!
@rlenain what would I need to do if have to train it in hindi language?
You can probably just finetune StyleTTS2 without changing the PL-BERT model, and it would work, with the right data and amount of data. If you want to train PL-BERT on Hindi, I believe there's data here: https://huggingface.co/datasets/styletts2-community/multilingual-pl-bert
@rlenain Regarding this multilingual pl-bert, it appears the data used to train this model uses a data-processing script that's unavailable to the general public - how would we be able to tokenize the training data for StyleTTS in the same form as the Bert model?
the data here (https://huggingface.co/datasets/styletts2-community/multilingual-pl-bert) has been tokenized using the tokenizer of the bert-multilingual-base-cased
model: https://huggingface.co/google-bert/bert-base-multilingual-cased
Hello @rlenain,
I've successfully trained StyleTTS2 with the multilingual PL-BERT from this source during the first stage using the LJSpeech dataset provided in this repository.
However, I encountered an issue at the start of the second stage where NaN values appeared. Could you help me identify any potential mistakes?
Here's what I've done so far:
- Converted the source WAV files to a 24k WAV format.
- Replaced the files in Utils/PLBERT/ with the multilingual PL-BERT.
- Conducted training on eight 3090 cards for 12 hours without any other modifications.
first stage loss graph
Appended
- with debug , i find the first nan comes from https://github.com/yl4579/StyleTTS2/blob/5cedc71c333f8d8b8551ca59378bdcc7af4c9529/train_second.py#L400
Hello @rlenain,
I've successfully trained StyleTTS2 with the multilingual PL-BERT from this source during the first stage using the LJSpeech dataset provided in this repository.
However, I encountered an issue at the start of the second stage where NaN values appeared. Could you help me identify any potential mistakes?
Here's what I've done so far:
- Converted the source WAV files to a 24k WAV format.
- Replaced the files in Utils/PLBERT/ with the multilingual PL-BERT.
- Conducted training on eight 3090 cards for 12 hours without any other modifications.
first stage loss graph
Appended
- with debug , i find the first nan comes from https://github.com/yl4579/StyleTTS2/blob/5cedc71c333f8d8b8551ca59378bdcc7af4c9529/train_second.py#L400
solve it, just a bad config that casuing the first stage params loads to second stage model params
I should config first_stage_path instead of pretrained_model
Hello @rlenain,
I've successfully trained StyleTTS2 with the multilingual PL-BERT from this source during the first stage using the LJSpeech dataset provided in this repository.
However, I encountered an issue at the start of the second stage where NaN values appeared. Could you help me identify any potential mistakes?
Here's what I've done so far:
- Converted the source WAV files to a 24k WAV format.
- Replaced the files in Utils/PLBERT/ with the multilingual PL-BERT.
- Conducted training on eight 3090 cards for 12 hours without any other modifications.
first stage loss graph
Appended
- with debug , i find the first nan comes from https://github.com/yl4579/StyleTTS2/blob/5cedc71c333f8d8b8551ca59378bdcc7af4c9529/train_second.py#L400
can you please elaborate which files you replaced in PLBERT folder
Hello @rlenain, I've successfully trained StyleTTS2 with the multilingual PL-BERT from this source during the first stage using the LJSpeech dataset provided in this repository. However, I encountered an issue at the start of the second stage where NaN values appeared. Could you help me identify any potential mistakes? Here's what I've done so far:
- Converted the source WAV files to a 24k WAV format.
- Replaced the files in Utils/PLBERT/ with the multilingual PL-BERT.
- Conducted training on eight 3090 cards for 12 hours without any other modifications.
first stage loss graph
Appended
- with debug , i find the first nan comes from https://github.com/yl4579/StyleTTS2/blob/5cedc71c333f8d8b8551ca59378bdcc7af4c9529/train_second.py#L400
can you please elaborate which files you replaced in PLBERT folder
just cp the multilingual pl-bert to the old folder, it works perfectly
@rlenain Did you use OpenSLR's recordings for Spanish, or something else?
@chocolatedesue would you mind closing the issue if it's solved now?
@chocolatedesue would you mind closing the issue if it's solved now?
I think you might have mistaken me for someone else.
This issue was not opened by me. It might be better to reach out directly to the original issuer, @paulovasconcellos-hotmart.
@chocolatedesue would you mind closing the issue if it's solved now?
I think you might have mistaken me for someone else.
This issue was not opened by me. It might be better to reach out directly to the original issuer, @paulovasconcellos-hotmart.
Sorry, I probably clicked the wrong name in autocomplete and didn't realize it. Thanks for pinging the original author :)