StyleTTS2 Fine-tuning or training from scratch in a differente language?

Hi everyone, I'm considering putting some effort into training StyleTTS in Portuguese. I have a good-quality dataset for this task, however, I was in doubt if it would be better just to fine-tune the model (which I know was trained in English), or (since it's an unseen language) train the model from scratch in Portuguese.

Does anyone have some tips on what I should consider before making a decision?

Jan 30 '24 18:01 paulovasconcellos-hotmart

Definitelly train a new PLBERT for a new language. You can try with the one trained in English but even the author says it probably won't work.

Feb 15 '24 07:02 martinambrus

Hi there -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert

Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!

Best of luck, and let me know what you make with this!

Feb 28 '24 09:02 rlenain

Thank you very much for this @rlenain . I'll use this model to train StyleTTS on my data

Feb 28 '24 20:02 paulovasconcellos-hotmart

Hi there -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert

Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!

Best of luck, and let me know what you make with this!

Nice work！Did the Chinese data the model used for training include tone？

Feb 29 '24 01:02 Stardust-minus

I'm not sure -- you can see a sample here (the data is from this dataset: https://huggingface.co/datasets/styletts2-community/multilingual-phonemes-10k-alpha/viewer/zh).

Feb 29 '24 10:02 rlenain

Thank you very much @rlenain! This is a great addition! You mentioned you can just finetune on a new language instead of training a new base model, I'd like to try it. How large are the datasets you used for the finetuning on a new language?

Mar 05 '24 16:03 Frederieke93

i tend to keep some english in the dataset (~5 hours) and have had success with as little as 20 hours of Spanish data split across 4 speakers

Mar 05 '24 17:03 rlenain

Where to see this 14 langs ?

На вт, 5.03.2024 г. в 19:26 Raphael Lenain @.***> написа:

i tend to keep some english in the dataset (~5 hours) and have had success with as little as 20 hours of Spanish data split across 4 speakers

— Reply to this email directly, view it on GitHub https://github.com/yl4579/StyleTTS2/issues/197#issuecomment-1979277514, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYTSUJO4TM5JKIW6AJPCETYWX53JAVCNFSM6AAAAABCRYTYCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZGI3TONJRGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Mar 06 '24 11:03 casic

https://huggingface.co/papercup-ai/multilingual-pl-bert

Mar 06 '24 11:03 rlenain

Thanks

На ср, 6.03.2024 г. в 13:06 Raphael Lenain @.***> написа:

https://huggingface.co/papercup-ai/multilingual-pl-bert

— Reply to this email directly, view it on GitHub https://github.com/yl4579/StyleTTS2/issues/197#issuecomment-1980620269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYTSULL7ABSYHBRN7DKVWTYW32CNAVCNFSM6AAAAABCRYTYCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQGYZDAMRWHE . You are receiving this because you commented.Message ID: @.***>

Mar 06 '24 11:03 casic

@rlenain > i tend to keep some english in the dataset (~5 hours) and have had success with as little as 20 hours of Spanish data split across 4 speakers

thanks for the great work! do you have some samples to share? I'm very curious about the quality on a new language

Mar 19 '24 18:03 ZYJGO

Unfortunately because of the privacy policy of the samples that I trained on, I cannot share these samples here. What I can say is that the quality is very much on-par with samples you can find on the samples page in English.

Mar 21 '24 10:03 rlenain

Unfortunately because of the privacy policy of the samples that I trained on, I cannot share these samples here. What I can say is that the quality is very much on-par with samples you can find on the samples page in English.

I would like to ask three questions: Do the speakers in the dataset need to be in a numeric format, for example, speaker 0, 1, 2, and do they have to be different from 0, or can I put all of them with the same name or even in a string format like a name to facilitate the recognition of the speakers? The other question is, after training the speakers, to access them, do I need to define the speakers in the inference and what about the language selector is automatic?

Apr 03 '24 14:04 traderpedroso

@rlenain

i tend to keep some english in the dataset (~5 hours) and have had success with as little as 20 hours of Spanish data split across 4 speakers

@rlenain Do you mind sharing for how many epochs you fine-tuned?

Apr 29 '24 16:04 sch0ngut

@sch0ngut Generally for 50k-100k iterations, whatever that means in terms of epochs for the size of your dataset. But you should be following the validation curve.

Apr 30 '24 09:04 rlenain

Hi there -- I have trained a PL-BERT model on a 14 language dataset which was crowdsourced by the author of the paper. You can find this model open-sourced here: https://huggingface.co/papercup-ai/multilingual-pl-bert

Using this PL-BERT model, you can now train multilingual StyleTTS2 models. In my experiments, I have found that you don't need to train from scratch in order to train multilingual StyleTTS2, you can just finetune. Follow the steps outlined in the link I shared above!

Best of luck, and let me know what you make with this!

@rlenain what would I need to do if have to train it in hindi language?

May 01 '24 12:05 21sK1p

You can probably just finetune StyleTTS2 without changing the PL-BERT model, and it would work, with the right data and amount of data. If you want to train PL-BERT on Hindi, I believe there's data here: https://huggingface.co/datasets/styletts2-community/multilingual-pl-bert

May 01 '24 12:05 rlenain

@rlenain Regarding this multilingual pl-bert, it appears the data used to train this model uses a data-processing script that's unavailable to the general public - how would we be able to tokenize the training data for StyleTTS in the same form as the Bert model?

May 02 '24 09:05 JingchengYang4

the data here (https://huggingface.co/datasets/styletts2-community/multilingual-pl-bert) has been tokenized using the tokenizer of the bert-multilingual-base-cased model: https://huggingface.co/google-bert/bert-base-multilingual-cased

May 02 '24 09:05 rlenain

Hello @rlenain,

I've successfully trained StyleTTS2 with the multilingual PL-BERT from this source during the first stage using the LJSpeech dataset provided in this repository.

However, I encountered an issue at the start of the second stage where NaN values appeared. Could you help me identify any potential mistakes?

Here's what I've done so far:

Converted the source WAV files to a 24k WAV format.
Replaced the files in Utils/PLBERT/ with the multilingual PL-BERT.
Conducted training on eight 3090 cards for 12 hours without any other modifications.

first stage loss graph

Appended

with debug , i find the first nan comes from https://github.com/yl4579/StyleTTS2/blob/5cedc71c333f8d8b8551ca59378bdcc7af4c9529/train_second.py#L400

May 08 '24 01:05 chocolatedesue

Hello @rlenain,

I've successfully trained StyleTTS2 with the multilingual PL-BERT from this source during the first stage using the LJSpeech dataset provided in this repository.

However, I encountered an issue at the start of the second stage where NaN values appeared. Could you help me identify any potential mistakes?

Here's what I've done so far:

Converted the source WAV files to a 24k WAV format.

Replaced the files in Utils/PLBERT/ with the multilingual PL-BERT.

Conducted training on eight 3090 cards for 12 hours without any other modifications.

first stage loss graph

Appended

with debug , i find the first nan comes from https://github.com/yl4579/StyleTTS2/blob/5cedc71c333f8d8b8551ca59378bdcc7af4c9529/train_second.py#L400

solve it, just a bad config that casuing the first stage params loads to second stage model params

I should config first_stage_path instead of pretrained_model

May 08 '24 02:05 chocolatedesue

Hello @rlenain,

I've successfully trained StyleTTS2 with the multilingual PL-BERT from this source during the first stage using the LJSpeech dataset provided in this repository.

However, I encountered an issue at the start of the second stage where NaN values appeared. Could you help me identify any potential mistakes?

Here's what I've done so far:

Converted the source WAV files to a 24k WAV format.

Replaced the files in Utils/PLBERT/ with the multilingual PL-BERT.

Conducted training on eight 3090 cards for 12 hours without any other modifications.

first stage loss graph

Appended

with debug , i find the first nan comes from https://github.com/yl4579/StyleTTS2/blob/5cedc71c333f8d8b8551ca59378bdcc7af4c9529/train_second.py#L400

can you please elaborate which files you replaced in PLBERT folder

Jul 19 '24 08:07 tanishbajaj101

Hello @rlenain, I've successfully trained StyleTTS2 with the multilingual PL-BERT from this source during the first stage using the LJSpeech dataset provided in this repository. However, I encountered an issue at the start of the second stage where NaN values appeared. Could you help me identify any potential mistakes? Here's what I've done so far:

Converted the source WAV files to a 24k WAV format.

Replaced the files in Utils/PLBERT/ with the multilingual PL-BERT.

Conducted training on eight 3090 cards for 12 hours without any other modifications.

first stage loss graph Appended

with debug , i find the first nan comes from https://github.com/yl4579/StyleTTS2/blob/5cedc71c333f8d8b8551ca59378bdcc7af4c9529/train_second.py#L400

can you please elaborate which files you replaced in PLBERT folder

just cp the multilingual pl-bert to the old folder, it works perfectly

Jul 19 '24 09:07 chocolatedesue

@rlenain Did you use OpenSLR's recordings for Spanish, or something else?

Jul 31 '24 00:07 Geremia

@chocolatedesue would you mind closing the issue if it's solved now?

Sep 01 '24 17:09 martinambrus

@chocolatedesue would you mind closing the issue if it's solved now?

I think you might have mistaken me for someone else.

This issue was not opened by me. It might be better to reach out directly to the original issuer, @paulovasconcellos-hotmart.

Sep 02 '24 02:09 chocolatedesue

@chocolatedesue would you mind closing the issue if it's solved now?

I think you might have mistaken me for someone else.

This issue was not opened by me. It might be better to reach out directly to the original issuer, @paulovasconcellos-hotmart.

Sorry, I probably clicked the wrong name in autocomplete and didn't realize it. Thanks for pinging the original author :)

Sep 02 '24 05:09 martinambrus

StyleTTS2 StyleTTS2 copied to clipboard

Fine-tuning or training from scratch in a differente language?

StyleTTS2
StyleTTS2 copied to clipboard