Real-Time-Voice-Cloning icon indicating copy to clipboard operation
Real-Time-Voice-Cloning copied to clipboard

Support for other languages

Open yaguangtang opened this issue 5 years ago • 106 comments

Available languages

Chinese (Mandarin): #811 German: #571* Swedish: #257*

* Requires Tensorflow 1.x (harder to set up).

Requested languages (not available yet)

Arabic: #871 Czech: #655 English: #388 (UK accent), #429 (Indian accent) French: #854 Hindi: #525 Italian: #697 Polish: #815 Portuguese: #531 Russian: #707 Spanish: #789 Turkish: #761 Ukrainian: #492

yaguangtang avatar Jul 02 '19 05:07 yaguangtang

You'll need to retrain with your own datasets to get another language running (and it's a lot of work). The speaker encoder is somewhat able to work on a few other languages than English because VoxCeleb is not purely English, but since the synthesizer/vocoder have been trained purely on English data, any voice that is not in English - and even, that does not have a proper English accent - will be cloned very poorly.

CorentinJ avatar Jul 02 '19 22:07 CorentinJ

Thanks for explaintation, I have big interest of adding other languages support, and would like to contribute.

yaguangtang avatar Jul 03 '19 02:07 yaguangtang

You'll need a good dataset (at least ~300 hours, high quality and transcripts) in the language of your choice, do you have that?

CorentinJ avatar Jul 03 '19 06:07 CorentinJ

I wanna train another language. How many speakers do I need in the Encoder? or can I use the English speaker embeddings to my language?

tail95 avatar Jul 04 '19 01:07 tail95

From here:

A particularity of the SV2TTS framework is that all models can be trained separately and on distinct datasets. For the encoder, one seeks to have a model that is robust to noise and able to capture the many characteristics of the human voice. Therefore, a large corpus of many different speakers would be preferable to train the encoder, without any strong requirement on the noise level of the audios. Additionally, the encoder is trained with the GE2E loss which requires no labels other than the speaker identity. (...) For the datasets of the synthesizer and the vocoder, transcripts are required and the quality of the generated audio can only be as good as that of the data. Higher quality and annotated datasets are thus required, which often means they are smaller in size.

You'll need two datasets: image

The first one should be a large dataset of untranscribed audio that can be noisy. Think thousands of speakers and thousands of hours. You can get away with a smaller one if you finetune the pretrained speaker encoder. Put maybe 1e-5 as learning rate. I'd recommend 500 speakers at the very least for finetuning. A good source for datasets of other languages is M-AILABS.

The second one needs audio transcripts and high quality audio. Here, finetuning won't be as effective as for the encoder, but you can get away with less data (300-500 hours). You will likely not have the alignments for that dataset, so you'll have to adapt the preprocessing procedure of the synthesizer to not split audio on silences. See the code and you'll understand what I mean.

Don't start training the encoder if you don't have a dataset for the synthesizer/vocoder, you won't be able to do anything then.

CorentinJ avatar Jul 04 '19 07:07 CorentinJ

You'll need a good dataset (at least ~300 hours, high quality and transcripts) in the language of your choice, do you have that?

Maybe it can be hacked by using audio book and they pdf2text version. The difficult come i guess from the level of expression on data sources. Maybe with some movies but sometimes subtitles are really poor. Firefox work on dataset to if i remember well

HumanG33k avatar Jul 05 '19 15:07 HumanG33k

You'll need a good dataset (at least ~300 hours, high quality and transcripts) in the language of your choice, do you have that?

Maybe it can be hacked by using audio book and they pdf2text version. The difficult come i guess from the level of expression on data sources. Maybe with some movies but sometimes subtitles are really poor. Firefox work on dataset to if i remember well

This is something that I have been slowly piecing together. I have been gathering audiobooks and their text versions that are in the public domain (Project Gutenberg & LibriVox Recordings). My goal as of now is to develop a solid package that can gather an audiofile and corresponding book, performing necessary cleaning and such.

Currently this project lives on my C:, but if there's interest in collaboration I'd gladly throw it here on GitHub.

zbloss avatar Jul 17 '19 03:07 zbloss

How many speakers are needed for synthesizer/vocoder training?

JasonWei512 avatar Jul 19 '19 03:07 JasonWei512

You'd want hundreds of speakers at least. In fact, LibriSpeech-clean makes for 460 speakers and it's still not enough.

CorentinJ avatar Jul 19 '19 10:07 CorentinJ

There's an open 12-hour Chinese female voice set from databaker that I tried with tacotron https://github.com/boltomli/tacotron/blob/zh/TRAINING_DATA.md#data-baker-data. Hope that I can gather more Chinese speakers to have a try on voice cloning. I'll update if I have some progress.

boltomli avatar Jul 19 '19 13:07 boltomli

That's not nearly enough to learn about the variations in speakers. Especially not on a hard language such as Chinese.

CorentinJ avatar Jul 19 '19 13:07 CorentinJ

@boltomli Take a look at this dataset (1505 hours, 6408 speakers, recorded on smartphones): https://www.datatang.com/webfront/opensource.html Samples.zip Not sure if the quality is good enough for encoder training.

JasonWei512 avatar Jul 20 '19 07:07 JasonWei512

You actually want the encoder dataset not to always be of good quality, because that makes the encoder robust. It's different for the synthesizer/vocoder, because the quality is the output you will have (at best)

CorentinJ avatar Jul 20 '19 09:07 CorentinJ

You'd want hundreds of speakers at least. In fact, LibriSpeech-clean makes for 460 speakers and it's still not enough.

Can not be hack to by create new speakers with ai like it is done for picture ?

HumanG33k avatar Jul 24 '19 08:07 HumanG33k

How about training the encoder/speaker_verification using English multi-speaker data-sets, but training the synthesizer using Chinese database, suppose both the data are enough for each individual model separately.

Liujingxiu23 avatar Jul 31 '19 09:07 Liujingxiu23

You can do that, but I would then add the synthesizer dataset in the speaker encoder dataset. In SV2TTS, they use disjoint datasets between the encoder and the synthesizer, but I think it's simply to demonstrate that the speaker encoder generalizes well (the paper is presented as a transfer learning paper over a voice cloning paper after all).

There's no guarantee the speaker encoder works well on different languages than it was trained on. Considering the difficulty of generating good Chinese speech, you might want to do your best at finding really good datasets rather than hack your way around everything.

CorentinJ avatar Aug 01 '19 11:08 CorentinJ

@CorentinJ Thank you for your reply,may be I should find some Chinese data-sets for ASR to train the speaker verification model.

Liujingxiu23 avatar Aug 02 '19 01:08 Liujingxiu23

@Liujingxiu23 Have you trained a Chinese model?And could you share your model about the Chinese clone results?

magneter avatar Aug 03 '19 16:08 magneter

@magneter I have not trained the Chinese model, I don't have enough data to train the speaker verification model, I am trying to collect suitable data now

Liujingxiu23 avatar Aug 05 '19 01:08 Liujingxiu23

You'd want hundreds of speakers at least. In fact, LibriSpeech-clean makes for 460 speakers and it's still not enough.

@CorentinJ Hello, ignoring speakers out of training dataset, if I only want to assure the quality and similarity of wav synthesized with speakers in the training dataset(librispeech-clean), how much time (at least) for one speaker do I need for training, maybe 20 minutes or less?

xw1324832579 avatar Aug 07 '19 08:08 xw1324832579

maybe 20 minutes or less?

Wouldn't that be wonderful. You'll still need a good week or so. A few hours if you use the pretrained model. Although at this point what you're doing is no longer voice cloning, so you're not really in the right repo for that.

CorentinJ avatar Aug 07 '19 10:08 CorentinJ

This is something that I have been slowly piecing together. I have been gathering audiobooks and their text versions that are in the public domain (Project Gutenberg & LibriVox Recordings). My goal as of now is to develop a solid package that can gather an audiofile and corresponding book, performing necessary cleaning and such.

Currently this project lives on my C:, but if there's interest in collaboration I'd gladly throw it here on GitHub.

@zbloss I'm very interested. Would you be able to upload your entire dataset somewhere? Or if it's difficult to upload, is there some way I could acquire it from you directly?

Thanks!

shawwn avatar Aug 10 '19 23:08 shawwn

@CorentinJ @yaguangtang @tail95 @zbloss @HumanG33k I am finetuning the encoder model by Chhinese data of 3100 persons. I want to know how to judge whether the train of finetune is OK. In Figure0, The blue line is based on 2100 persons , the yellow line is based on 3100 persons which is trained now. Figure0: image

Figure1:(finetune 920k , from 1565k to 1610k steps, based on 2100 persons) image

Figure2:(finetune 45k from 1565k to 1610k steps, based on 3100 persons) image

I also what to know how mang steps is OK , in general. Because, I only know to train the synthesizer model and vocoder mode oneby one to judge the effect. But it will cost very long time. How about my EER or Loss ? Look forward your reply!

WendongGan avatar Aug 16 '19 02:08 WendongGan

If your speakers are cleanly separated in the space (like they are in the pictures), you should be good to go! I'd be interested to compare with the same plots but before any training step was made, to see how the model does on Chinese data.

CorentinJ avatar Aug 16 '19 09:08 CorentinJ

Hey guys, does anyone have pretrained models for Chinese. I need it for my collage project .

Jessicamat777 avatar Sep 16 '19 16:09 Jessicamat777

Will it work on Recoded Phone Calls ? If worked I can provide million of hours recordings in Bangali Language.

MuzahidGithub avatar Sep 16 '19 19:09 MuzahidGithub

An interesting feat would be to be able to train it to, for instance, go through youtube playlists with hundreds of videos in a specified language, could grab the audio stream using something like the youtube-dl project to a temp folder, use it for training and repeat with each video, will try to see if that is possible, amazing work!

alabastida avatar Sep 19 '19 03:09 alabastida

Can I continue training in Chinese corpus using the pre-training model you provided? @CorentinJ

liuzongquan avatar Sep 20 '19 06:09 liuzongquan

For the encoder yes, for the synthesizer I wouldn't recommend it. For the vocoder, probably.

CorentinJ avatar Sep 20 '19 09:09 CorentinJ

Got it. Thanks a lot~

liuzongquan avatar Sep 20 '19 11:09 liuzongquan