vits icon indicating copy to clipboard operation
vits copied to clipboard

is that able to train on Chinese dataset?

Open lucasjinreal opened this issue 3 years ago • 50 comments

is that able to train on Chinese dataset?

lucasjinreal avatar Jun 11 '21 06:06 lucasjinreal

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes. This model gets phonemes as input rather than characters.

jaywalnut310 avatar Jun 11 '21 06:06 jaywalnut310

@jaywalnut310 I probably would try Biaobei data for Chinese, I am totally newbie in tts though. Let me have a deep look. what would be phonemes like in Chinese?

lucasjinreal avatar Jun 11 '21 06:06 lucasjinreal

@jaywalnut310 I probably would try Biaobei data for Chinese, I am totally newbie in tts though. Let me have a deep look. what would be phonemes like in Chinese? The phonemes in Chinese are initials and finals with tone, for example, "ni2 hao3" can be converted into "n i2 h ao3"

LG-SS avatar Jun 11 '21 09:06 LG-SS

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes. This model gets phonemes as input rather than characters.

Curiously, has the paper been released publically? I have not retrieved it on arxiv or Google Scholar currently.

LG-SS avatar Jun 11 '21 09:06 LG-SS

@LG-SS Now the paper is available: https://arxiv.org/abs/2106.06103

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes. This model gets phonemes as input rather than characters.

Curiously, has the paper been released publically? I have not retrieved it on arxiv or Google Scholar currently.

jaywalnut310 avatar Jun 14 '21 01:06 jaywalnut310

@jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits faster in terms of predict same length sentence, or slow? How much?

lucasjinreal avatar Jun 15 '21 12:06 lucasjinreal

@jaywalnut310 This model is autoregressive or non autoregressive ?

leminhnguyen avatar Jun 15 '21 16:06 leminhnguyen

@jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits faster in terms of predict same length sentence, or slow? How much?

Hi @jinfagang. In a previous work, Glow-TTS, a synthesis speed test between Tacotron 2 and Glow-TTS was reported. As the synthesis speed of VITS is faster than that of Glow-TTS + HiFi-GAN(vocoder), it would be much faster than that of Tacotron 2 + HiFi-GAN(vocoder).

image

jaywalnut310 avatar Jun 15 '21 23:06 jaywalnut310

@jaywalnut310 This model is autoregressive or non autoregressive ?

Hi @leminhnguyen, this model is non autoregressive.

jaywalnut310 avatar Jun 15 '21 23:06 jaywalnut310

@jaywalnut310 Thanks you, I have some questions.

  1. How about controllability ?
  2. We can change the duration, energy or pitch ?
  3. In the paper, you mentioned FastSpeech2 in related work. Did you try to compare the speed between Fastspeech2 and VITS ?

leminhnguyen avatar Jun 16 '21 05:06 leminhnguyen

@jaywalnut310 I listened the sample audio from vits, it's much more better and natural than tactron2, so it's better and faster, more valuable to have a try. Do u guys have a Chinese pretrained model BTW?

lucasjinreal avatar Jun 16 '21 06:06 lucasjinreal

@leminhnguyen Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation (z in our code), but you cannot predict how much the energy and pitch changed beforehand. and I only compared with open-sourced official implementations (unfortunately FastSpeech 2 is not), in quality.

@jinfagang Thank you :). I haven't trained on Chinese dataset, but it would be great if someone try and share it later.

jaywalnut310 avatar Jun 16 '21 22:06 jaywalnut310

@jaywalnut310 I can train on BIAOBEI dataset which is a opensource Chinese dataset. But can u tell me which way should I organise it?

lucasjinreal avatar Jun 17 '21 03:06 lucasjinreal

Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation (z in our code), but you cannot predict how much the energy and pitch changed beforehand. and I only compared with open-sourced official implementations (unfortunately FastSpeech 2 is not), in quality.

@jaywalnut310 I am only familiar with Tacotron, and have not yet used a model with variability. What parameters should I change in inference the inference code to change duration or pitch? Or are you saying this needs to be done during training?

TaoTeCha avatar Jun 21 '21 17:06 TaoTeCha

@jinfagang @TaoTeCha Have you trained a Chinese model successfully? Also are you planning to open source the model?

WadoodAbdul avatar Sep 20 '21 06:09 WadoodAbdul

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

MaxMax2016 avatar Sep 26 '21 07:09 MaxMax2016

@dtx525942103 that's amazing! It can synthesis so long voice! Do u plan to opensource your code training on Chinese?

lucasjinreal avatar Sep 27 '21 02:09 lucasjinreal

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

@dtx525942103 好棒呀!可否问一下您是用什么数据集训练的呢?以及提取一个中文的新音色需要的音频时长大概多久呢。非常非常感谢您!

hemath1001 avatar Jan 18 '22 03:01 hemath1001

用的DB1那个数据集,它是1万句

MaxMax2016 avatar Jan 18 '22 03:01 MaxMax2016

用的DB1那个数据集,它是1万句

@dtx525942103 感谢回复~可否告知一下数据集的全名呢,这个简称没有搜到 T_T 是databaker对吗?

hemath1001 avatar Jan 18 '22 03:01 hemath1001

@dtx525942103 同求

lucasjinreal avatar Jan 18 '22 07:01 lucasjinreal

weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar ​​​

感恩的心 感谢有你~~

hemath1001 avatar Jan 18 '22 07:01 hemath1001

I have trained about 1000 epochs, not fully trained, but the result seems impressive.

I upload several examples on Mandrain, for anyone interested: 中文语音合成实例.zip

lucasjinreal avatar Jan 26 '22 07:01 lucasjinreal

@hemath1001 the chinese model :https://github.com/dtx525942103/vits_chinese/issues/3

MaxMax2016 avatar Feb 07 '22 04:02 MaxMax2016

is that able to train on Chinese dataset? Hello, can you tell me if this error occurs when using the phonemizer function with the backend parameter as espeak?(RuntimeError: failed to find espeak library) I would like to know how to download the espeak. Thank you!

yuyu122 avatar Mar 11 '22 08:03 yuyu122

@dtx525942103 你好,你训练的效果非常棒,请问你训练的时候是不是设置了 add_blank=True ?

yt605155624 avatar Jun 06 '22 07:06 yt605155624

yes, you have to add this arg in config file.

I provide a chinese example model in this repo. https://github.com/wac81/vits_chinese

wac81 avatar Jun 15 '22 02:06 wac81

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

我很想问一下,后验编码器为啥使用线性谱,不直接使用mel谱呢?我看论文里mel重建损失也是用mel谱计算的。。

wgc7998 avatar Aug 05 '22 07:08 wgc7998

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

我很想问一下,后验编码器为啥使用线性谱,不直接使用mel谱呢?我看论文里mel重建损失也是用mel谱计算的。。

论文里面说的,使用线性普的效果比使用mel谱的效果更好

MaxMax2016 avatar Aug 08 '22 03:08 MaxMax2016

大佬您的代码不开源了吗?已经找不到了~

sixyang avatar Aug 17 '22 17:08 sixyang

@jaywalnut310 @TaoTeCha you said that we can control and change the energy and pitch by manipulating the latent representation (z in our code). Can you specify how ? i mean what values of z affect energy, pitch,....?

tuannvhust avatar Oct 26 '22 04:10 tuannvhust