StyleTTS2 icon indicating copy to clipboard operation
StyleTTS2 copied to clipboard

Awesome in english but no support for other languages - please add an example for another language (german, italian, french etc)

Open cmp-nct opened this issue 1 year ago • 85 comments

The readme makes it sound very simple: "Replace bert with xphonebert" Looking a bit closer looks like it's quite a feat to make StyleTTS2 talk in non-english languages (https://github.com/yl4579/StyleTTS2/issues/28)

StyleTTS2 looks like the best approach we have right now, but only english is a killer for many as it means any app will be limited to english without prospect for other users in sight.

Some help to get this going in foreign languages would be awesome.

It appears we need to change inference code and re-train text and phonetics. Any demo/guide would be great

Alternatively re-training the current PL-Bert for other languages, though that needs a corpus and I've no idea on the cost ? (https://github.com/yl4579/PL-BERT)

cmp-nct avatar Nov 20 '23 00:11 cmp-nct

The repo so far is a research project and its main purpose serves more as a proof of concept for the paper than a full-fledged open source project. I agree that PL-BERT is the major obstacle to generalize to other languages, but training large-scale language models particularly on multiple languages can be very challenging. With the resources I have in the school, training PL-BERT on English only corpus with 3 A40 took me a month, with all the ablation studies and experiment, I spent an entire summer on this project only for a single language.

I'm not affiliated with any company and I'm only a PhD student, and the GPU resources in our lab need to be prioritized for new research projects. I don't think I will have resources to train a multi-lingual PL-BERT model at the time being, so PL-BERT probably is not the best approach to multilingual models for StyleTTS 2.

I have never tried XPhoneBERT myself, but it seems to be a promising alternative PL-BERT. The only problem of it is that it uses a different phonemizer, which can also be related to #40 . The current phonemizer was taken from VITS, which also incurs license issues (MIT vs. GPL). It would be great if someone could help to switch the phoneimzer and BERT model to things like XPhoneBERT that is compatible with MIT license and also supports multiple languages.

The basic idea is to re-train the ASR model (https://github.com/yl4579/AuxiliaryASR) using the phonemizer of XPhoneBERT, and replace PL-BERT with XPhoneBERT and re-train the model from scratch. Since the models, especially the model LibriTTS, took about 2 weeks to train on 4 A100, I do not think I have enough GPU resources to work on this for the time being. If anyone is willing to sponsor GPUs and datasets for either multilingual PL-BERT or XPhoneBERT StyleTTS 2, I'm happy to extend this project towards the multilingual directions.

yl4579 avatar Nov 20 '23 01:11 yl4579

I think it would be doable to get the GPU time, 1 week of 8xA100 maybe in exchange of naming the resulting model after the sponsor. One of the cloud providers might be interested, or some guys from the ML discords who train a lot might have it spare. I was offered GPU time once, could ask the guy. But without datasets that wouldn't help That said: If you need GPU time let me know, I'll ask

Datasets: German: TTS dataset from a university (high quality, 6 main speakers, I think 40-50 hours of studio quality recordings) https://opendata.iisys.de/dataset/hui-audio-corpus-german/ (https://github.com/iisys-hof/HUI-Audio-Corpus-German) https://github.com/thorstenMueller/Thorsten-Voice (11 hours, one person)

Italian: TTS dataset, LJSpeech affiliated ? https://huggingface.co/datasets/z-uo/female-LJSpeech-italian https://huggingface.co/datasets/z-uo/male-LJSpeech-italian

Multilingual: https://www.openslr.org/94/ (audiobook based libritts) https://github.com/freds0/CML-TTS-Dataset (more than 3000 hours, CS licensed)

Sidenote: For detecing unclean audio, possibly "CLAP" from Laion could be used.

cmp-nct avatar Nov 20 '23 02:11 cmp-nct

Multilingual speech datasets are more difficult to get than language datasets. XPhoneBERT for example was trained entirely on Wikipedia in 100+ languages, but getting 100+ languages of speech data with transcriptions is more difficult. XTTS has multilingual supports but the data used seems private. I believe the creator @erogol was once interested in StyleTTS but did not proceed to integrate this into Coqui API for some reason. It would be great if he could help for multilingual supports. I will ping him to see if he is still interested.

yl4579 avatar Nov 20 '23 02:11 yl4579

I found quite good datasets for Italian and German, will take another look for more. Will update the previous comment. About how much data (length, # of speakers) is needed when training ?

cmp-nct avatar Nov 20 '23 03:11 cmp-nct

If you want cross-lingual generalization, I think each language should be at least 100 hours. The data you provide probably is good for a single speaker model, but not enough for zero-shot models like XTTS. It is not feasible to get a model like that with publicly available data. We probably have to rely on something like multilingual librispeech (https://www.openslr.org/94/) and use some speech restoration models to remove bad samples. This is not a single person's effort, so everyone else is welcome to contribute.

yl4579 avatar Nov 20 '23 05:11 yl4579

It's a pity not supporting Chinese.

mzdk100 avatar Nov 21 '23 09:11 mzdk100

I can make a 8x 3090 (24GB) machine available, if it's of use. 2x Xeon E5-2698 v3 cpus, 128GB ram. Alternatively: a 4x 3090 box with nvlinks, Epyc 7443p, 256GB, pcie 4.0. Send a mail to [email protected]

hobodrifterdavid avatar Nov 21 '23 13:11 hobodrifterdavid

I can support for training turkish model, just need a help for training pl-bert for turkish wikipedia dataset.

tosunozgun avatar Nov 21 '23 15:11 tosunozgun

@hobodrifterdavid Thanks so much for your help. What you have now is probably good for multilingual PL-BERT training as long as you can keep this machine running for at least a couple of months or so. Just sent you an email for multilingual PL-BERT training.

yl4579 avatar Nov 21 '23 20:11 yl4579

I think the GPUs provided by @hobodrifterdavid would be a great start for multilingual PL-BERT training. Before proceeding though, I need some people who speak as many languages as possible (hopefully also have some knowledge in IPA) to help with the data preparation. I only speak English, Chinese and Japanese, so I can only help with these 3 languages.

My plan is to use this multilingual BERT tokenizer: https://huggingface.co/bert-base-multilingual-cased, tokenize the text, get the corresponding tokens, use phonemizer to get the corresponding phonemes, and align the phonemes with tokens. Since this tokenizer is subword, we cannot predict the subword grapheme tokens. So my idea is instead of predicting the grapheme tokens (which is not a full grapheme anyway, and we cannot really align half of a grapheme to some of its phonemes, like in English "phonemes" can be tokenized into phone#, #me#, #s, but the actual phonemes of it is /ˈfəʊniːmz/, which cannot be aligned perfectly with either phone# or #me# or #s) we predict the contextualized embeddings from a pre-trained BERT model.

For example, for the sentence "This is a test sentence", we get 5 tokens [this, is, a, test, sen#, #tence] and its corresponding graphemes. Particularly, these [sen#, #tence] two tokens correspond to ˈsɛnʔn̩ts. The goal is to map each of the grpaheme representation in ˈsɛnʔn̩ts to the average contextualized BERT embeddings of [sen#, #tence]. This requires running the teacher BERT model, but we can extract the contextualized BERT embeddings online (during training) and maximize the cosine similarity of the predicted embeddings of these words and the teacher model (multilingual BERT).

Now the biggest challenge is aligning the tokenizer output to the graphemes, which may require some expertise in the specific languages. There could be potential quirks, inaccuracy or traps for certain languages. For example, phonemizer doesn't work with Japanese and Chinese directly, you have to first phonemize the grapheme into alphabets and then use phonemizer. The characters in these languages do not always have the same pronunciations depending on the context, so expertise in these languages is needed when doing NLP with them. To make sure the data preprocessing goes as smooth and accurate as possible, any help from those who speaks any language in this list (or knows some linguistics about these languages) is greatly appreciated.

yl4579 avatar Nov 21 '23 21:11 yl4579

I think the GPUs provided by @hobodrifterdavid would be a great start for multilingual PL-BERT training. Before proceeding though, I need some people who speak as many languages as possible (hopefully also have some knowledge in IPA) to help with the data preparation. I only speak English, Chinese and Japanese, so I can only help with these 3 languages.

My plan is to use this multilingual BERT tokenizer: https://huggingface.co/bert-base-multilingual-cased, tokenize the text, get the corresponding tokens, use phonemizer to get the corresponding phonemes, and align the phonemes with tokens. Since this tokenizer is subword, we cannot predict the subword grapheme tokens. So my idea is instead of predicting the grapheme tokens (which is not a full grapheme anyway, and we cannot really align half of a grapheme to some of its phonemes, like in English "phonemes" can be tokenized into phone#, #me#, #s, but the actual phonemes of it is /ˈfəʊniːmz/, which cannot be aligned perfectly with either phone# or #me# or #s) we predict the contextualized embeddings from a pre-trained BERT model.

For example, for the sentence "This is a test sentence", we get 5 tokens [this, is, a, test, sen#, #tence] and its corresponding graphemes. Particularly, these [sen#, #tence] two tokens correspond to ˈsɛnʔn̩ts. The goal is to map each of the grpaheme representation in ˈsɛnʔn̩ts to the average contextualized BERT embeddings of [sen#, #tence]. This requires running the teacher BERT model, but we can extract the contextualized BERT embeddings online (during training) and maximize the cosine similarity of the predicted embeddings of these words and the teacher model (multilingual BERT).

Now the biggest challenge is aligning the tokenizer output to the graphemes, which may require some expertise in the specific languages. Any help from those who speaks any language in this list (or knows some linguistics about these languages) is appreciated.

I can speak Persian, Japanese and a little bit of Arabic. (Have a friend fleunt in this as well). I would very much like to help you with this. I'm also gathering Labeled Speech data for these languages as of right now. (I have a little less than 100 hours for Persian and a bit with the other two). So, Count me in please.

SoshyHayami avatar Nov 21 '23 21:11 SoshyHayami

@SoshyHayami Thanks for your willingness to help.

Fortunately, I think most other languages that have whitespaces between words can be handled with the same logic. The only supported languages that do not have space between them are Chinese, Japanese (including Korean Hanja rarely), and Burmese. These are probably languages that need to be handled with their own logics. I can handle the first two languages, and we just need someone to handle the other two (Korean Hanja and Burmese).

yl4579 avatar Nov 21 '23 21:11 yl4579

It would be great if it could support Chinese language! I am a native Chinese, and I don't know what help I can provide?

mzdk100 avatar Nov 21 '23 22:11 mzdk100

Maybe I’ll create a new branch in the PL-BERT repo for multilingual processing scripts. Chinese and Japanese definitely needs to be processed separately with their own logics. @mzdk100 If you have some good Chinese phonemizer (Chinese characters to pinyin), you are welcome to contribute.

yl4579 avatar Nov 21 '23 23:11 yl4579

in the case of Japanese, since it already has Kana which is basically an alphabet, can't we simply restrict it to just that for now?(Kana and Romaji should be easier to phonemize if I'm not mistaken here.) Sorry it might be a stupid Idea but I was thinking about if we had another language model that would recognize the correct pronunciations based on the context and then would convert the text (and the converted text would be handed over to the phonemizer), maybe it could make things a bit easier here.

though It'll probably make inference a torture as well on low-performance devices.

SoshyHayami avatar Nov 21 '23 23:11 SoshyHayami

@yl4579 There are two main libraries for handling Chinese tokens, jieba and pypinyin. Jieba is based on Chinese word segmentation mode, while pypinyin is based on Chinese pinyin mode.

pip3 install jieba pypinyin
from pypinyin import lazy_pinyin, pinyin, Style
print(pinyin('朝阳')) # [['zhāo'], ['yáng']]
print(pinyin('朝阳', heteronym=True)) # [['zhāo', 'cháo'], ['yáng']]
print(pinyin('聪明的小兔子')) # ['cong', 'ming', 'de', 'xiao', 'tu', 'zi']
print(lazy_pinyin('聪明的小兔子', style=Style.TONE3)) # ['cong1', 'ming2', 'de', 'xiao3', 'tu4', 'zi']

There are many Chinese characters, and using pinyin can greatly reduce the number of vocabulary and potentially make the model smaller.

import jieba
print(list(jieba.cut('你好,我是中国人'))) # ['你好', ',', '我', '是', '中国', '人']
print(list(jieba.cut_for_search('你好,我是中国人'))) # ['你好', ',', '我', '是', '中国', '人']

If using word segmentation mode, the model can learn more natural language features, but the Chinese vocabulary is very large, and perhaps the model will be super large, and the computational power requirements are unimaginable. It is highly recommended to use Pinyin mode, as the converted text looks more like English without the need to change too many training codes.

print(' '.join(lazy_pinyin('聪明的小兔子', style=Style.TONE3))) # 'cong1 ming2 de xiao3 tu4 zi'

mzdk100 avatar Nov 22 '23 00:11 mzdk100

If german ears are needed, I'd be happy to lend

cmp-nct avatar Nov 22 '23 22:11 cmp-nct

https://github.com/rime/rime-terra-pinyin/blob/master/terra_pinyin.dict.yaml

From the industrial world, this is the characters-to-pinyin solution that the well-known input method editor Rime uses.

nicognaW avatar Nov 23 '23 03:11 nicognaW

any help from those who speaks any language in this list (or knows some linguistics about these languages) is greatly appreciated

keen to extend this to malayalam, dravidian language spoken in south india. will help for that.

dsplog avatar Nov 23 '23 03:11 dsplog

I hope Cantonese or Traditional Chinese is also considered when training the multilingual system, I can definitely help regarding this language. Is there any cooperation channel for this task?

rjrobben avatar Nov 24 '23 12:11 rjrobben

Multilingual speech datasets are more difficult to get than language datasets. XPhoneBERT for example was trained entirely on Wikipedia in 100+ languages, but getting 100+ languages of speech data with transcriptions is more difficult. XTTS has multilingual supports but the data used seems private. I believe the creator was once interested in StyleTTS but did not proceed to integrate this into Coqui API for some reason. It would be great if he could help for multilingual supports. I will ping him to see if he is still interested.

Personally, I do not support Coqui TTS. XTTS is not open-sourced according to OSI because of its ultra-restrictive license. I believe that the future of TTS lies in open-source models such as StyleTTS.

fakerybakery avatar Nov 24 '23 22:11 fakerybakery

@rjrobben I have created a slack channel for this multilingual PL-BERT: https://join.slack.com/t/multilingualstyletts2/shared_invite/zt-2805io6cg-0ROMhjfW9Gd_ix_FJqjGmQ

yl4579 avatar Nov 24 '23 22:11 yl4579

Also https://github.com/yl4579/PL-BERT/issues/22 this maybe helpful, if anyone could try it out.

yl4579 avatar Nov 24 '23 22:11 yl4579

@yl4579 Thanks for making the slack channel! Are you planning to make a slack channel for general StyleTTS 2-related discussions as well? Just because GH Discussions isn't realtime?

fakerybakery avatar Nov 24 '23 22:11 fakerybakery

@fakerybakery I can make this channel generally StyleTTS2-related if it is better. I can change the title to StyleTTS 2 instead.

yl4579 avatar Nov 24 '23 22:11 yl4579

Great, thanks! Maybe make one chatroom just about BERT instead?

fakerybakery avatar Nov 24 '23 22:11 fakerybakery

Yeah I've already done that. There's a channel about multilingual PLBERT.

yl4579 avatar Nov 24 '23 22:11 yl4579

Great! Are you planning to add the link to the README?

fakerybakery avatar Nov 24 '23 22:11 fakerybakery

It expires every 30 days I don't know if there's a better to get a permanent link.

yl4579 avatar Nov 24 '23 22:11 yl4579

I think there's a way to set it to never expire, right?

fakerybakery avatar Nov 24 '23 22:11 fakerybakery