vits is that able to train on Chinese dataset?

is that able to train on Chinese dataset?

Open lucasjinreal opened this issue 3 years ago • 50 comments

is that able to train on Chinese dataset?

Jun 11 '21 06:06 lucasjinreal

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes. This model gets phonemes as input rather than characters.

Jun 11 '21 06:06 jaywalnut310

@jaywalnut310 I probably would try Biaobei data for Chinese, I am totally newbie in tts though. Let me have a deep look. what would be phonemes like in Chinese?

Jun 11 '21 06:06 lucasjinreal

@jaywalnut310 I probably would try Biaobei data for Chinese, I am totally newbie in tts though. Let me have a deep look. what would be phonemes like in Chinese? The phonemes in Chinese are initials and finals with tone, for example, "ni2 hao3" can be converted into "n i2 h ao3"

Jun 11 '21 09:06 LG-SS

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes. This model gets phonemes as input rather than characters.

Curiously, has the paper been released publically? I have not retrieved it on arxiv or Google Scholar currently.

Jun 11 '21 09:06 LG-SS

@LG-SS Now the paper is available: https://arxiv.org/abs/2106.06103

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes. This model gets phonemes as input rather than characters.

Curiously, has the paper been released publically? I have not retrieved it on arxiv or Google Scholar currently.

Jun 14 '21 01:06 jaywalnut310

@jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits faster in terms of predict same length sentence, or slow? How much?

Jun 15 '21 12:06 lucasjinreal

@jaywalnut310 This model is autoregressive or non autoregressive ?

Jun 15 '21 16:06 leminhnguyen

@jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits faster in terms of predict same length sentence, or slow? How much?

Hi @jinfagang. In a previous work, Glow-TTS, a synthesis speed test between Tacotron 2 and Glow-TTS was reported. As the synthesis speed of VITS is faster than that of Glow-TTS + HiFi-GAN(vocoder), it would be much faster than that of Tacotron 2 + HiFi-GAN(vocoder).

Jun 15 '21 23:06 jaywalnut310

@jaywalnut310 This model is autoregressive or non autoregressive ?

Hi @leminhnguyen, this model is non autoregressive.

Jun 15 '21 23:06 jaywalnut310

@jaywalnut310 Thanks you, I have some questions.

How about controllability ?
We can change the duration, energy or pitch ?
In the paper, you mentioned FastSpeech2 in related work. Did you try to compare the speed between Fastspeech2 and VITS ?

Jun 16 '21 05:06 leminhnguyen

@jaywalnut310 I listened the sample audio from vits, it's much more better and natural than tactron2, so it's better and faster, more valuable to have a try. Do u guys have a Chinese pretrained model BTW?

Jun 16 '21 06:06 lucasjinreal

@leminhnguyen Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation (z in our code), but you cannot predict how much the energy and pitch changed beforehand. and I only compared with open-sourced official implementations (unfortunately FastSpeech 2 is not), in quality.

@jinfagang Thank you :). I haven't trained on Chinese dataset, but it would be great if someone try and share it later.

Jun 16 '21 22:06 jaywalnut310

@jaywalnut310 I can train on BIAOBEI dataset which is a opensource Chinese dataset. But can u tell me which way should I organise it?

Jun 17 '21 03:06 lucasjinreal

Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation (z in our code), but you cannot predict how much the energy and pitch changed beforehand. and I only compared with open-sourced official implementations (unfortunately FastSpeech 2 is not), in quality.

@jaywalnut310 I am only familiar with Tacotron, and have not yet used a model with variability. What parameters should I change in inference the inference code to change duration or pitch? Or are you saying this needs to be done during training?

Jun 21 '21 17:06 TaoTeCha

@jinfagang @TaoTeCha Have you trained a Chinese model successfully? Also are you planning to open source the model?

Sep 20 '21 06:09 WadoodAbdul

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

Sep 26 '21 07:09 MaxMax2016

@dtx525942103 that's amazing! It can synthesis so long voice! Do u plan to opensource your code training on Chinese?

Sep 27 '21 02:09 lucasjinreal

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

@dtx525942103 好棒呀！可否问一下您是用什么数据集训练的呢？以及提取一个中文的新音色需要的音频时长大概多久呢。非常非常感谢您！

Jan 18 '22 03:01 hemath1001

用的DB1那个数据集，它是1万句

Jan 18 '22 03:01 MaxMax2016

用的DB1那个数据集，它是1万句

@dtx525942103 感谢回复~可否告知一下数据集的全名呢，这个简称没有搜到 T_T 是databaker对吗？

Jan 18 '22 03:01 hemath1001

@dtx525942103 同求

Jan 18 '22 07:01 lucasjinreal

weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar

感恩的心感谢有你~~

Jan 18 '22 07:01 hemath1001

I have trained about 1000 epochs, not fully trained, but the result seems impressive.

I upload several examples on Mandrain, for anyone interested: 中文语音合成实例.zip

Jan 26 '22 07:01 lucasjinreal

@hemath1001 the chinese model :https://github.com/dtx525942103/vits_chinese/issues/3

Feb 07 '22 04:02 MaxMax2016

is that able to train on Chinese dataset? Hello, can you tell me if this error occurs when using the phonemizer function with the backend parameter as espeak?（RuntimeError: failed to find espeak library) I would like to know how to download the espeak. Thank you!

Mar 11 '22 08:03 yuyu122

@dtx525942103 你好，你训练的效果非常棒，请问你训练的时候是不是设置了 add_blank=True ？

Jun 06 '22 07:06 yt605155624

yes, you have to add this arg in config file.

I provide a chinese example model in this repo. https://github.com/wac81/vits_chinese

Jun 15 '22 02:06 wac81

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

我很想问一下，后验编码器为啥使用线性谱，不直接使用mel谱呢？我看论文里mel重建损失也是用mel谱计算的。。

Aug 05 '22 07:08 wgc7998

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

我很想问一下，后验编码器为啥使用线性谱，不直接使用mel谱呢？我看论文里mel重建损失也是用mel谱计算的。。

论文里面说的，使用线性普的效果比使用mel谱的效果更好

Aug 08 '22 03:08 MaxMax2016

大佬您的代码不开源了吗？已经找不到了~

Aug 17 '22 17:08 sixyang

@jaywalnut310 @TaoTeCha you said that we can control and change the energy and pitch by manipulating the latent representation (z in our code). Can you specify how ? i mean what values of z affect energy, pitch,....?

Oct 26 '22 04:10 tuannvhust

vits vits copied to clipboard

is that able to train on Chinese dataset?

vits
vits copied to clipboard