vits
vits copied to clipboard
is that able to train on Chinese dataset?
is that able to train on Chinese dataset?
Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes. This model gets phonemes as input rather than characters.
@jaywalnut310 I probably would try Biaobei data for Chinese, I am totally newbie in tts though. Let me have a deep look. what would be phonemes like in Chinese?
@jaywalnut310 I probably would try Biaobei data for Chinese, I am totally newbie in tts though. Let me have a deep look. what would be phonemes like in Chinese? The phonemes in Chinese are initials and finals with tone, for example, "ni2 hao3" can be converted into "n i2 h ao3"
Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes. This model gets phonemes as input rather than characters.
Curiously, has the paper been released publically? I have not retrieved it on arxiv or Google Scholar currently.
@LG-SS Now the paper is available: https://arxiv.org/abs/2106.06103
Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes. This model gets phonemes as input rather than characters.
Curiously, has the paper been released publically? I have not retrieved it on arxiv or Google Scholar currently.
@jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits faster in terms of predict same length sentence, or slow? How much?
@jaywalnut310 This model is autoregressive or non autoregressive ?
@jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits faster in terms of predict same length sentence, or slow? How much?
Hi @jinfagang. In a previous work, Glow-TTS, a synthesis speed test between Tacotron 2 and Glow-TTS was reported. As the synthesis speed of VITS is faster than that of Glow-TTS + HiFi-GAN(vocoder), it would be much faster than that of Tacotron 2 + HiFi-GAN(vocoder).
@jaywalnut310 This model is autoregressive or non autoregressive ?
Hi @leminhnguyen, this model is non autoregressive.
@jaywalnut310 Thanks you, I have some questions.
- How about controllability ?
- We can change the duration, energy or pitch ?
- In the paper, you mentioned FastSpeech2 in related work. Did you try to compare the speed between Fastspeech2 and VITS ?
@jaywalnut310 I listened the sample audio from vits, it's much more better and natural than tactron2, so it's better and faster, more valuable to have a try. Do u guys have a Chinese pretrained model BTW?
@leminhnguyen Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation (z in our code), but you cannot predict how much the energy and pitch changed beforehand. and I only compared with open-sourced official implementations (unfortunately FastSpeech 2 is not), in quality.
@jinfagang Thank you :). I haven't trained on Chinese dataset, but it would be great if someone try and share it later.
@jaywalnut310 I can train on BIAOBEI dataset which is a opensource Chinese dataset. But can u tell me which way should I organise it?
Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation (z in our code), but you cannot predict how much the energy and pitch changed beforehand. and I only compared with open-sourced official implementations (unfortunately FastSpeech 2 is not), in quality.
@jaywalnut310 I am only familiar with Tacotron, and have not yet used a model with variability. What parameters should I change in inference the inference code to change duration or pitch? Or are you saying this needs to be done during training?
@jinfagang @TaoTeCha Have you trained a Chinese model successfully? Also are you planning to open source the model?
vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing
@dtx525942103 that's amazing! It can synthesis so long voice! Do u plan to opensource your code training on Chinese?
vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing
@dtx525942103 好棒呀!可否问一下您是用什么数据集训练的呢?以及提取一个中文的新音色需要的音频时长大概多久呢。非常非常感谢您!
用的DB1那个数据集,它是1万句
用的DB1那个数据集,它是1万句
@dtx525942103 感谢回复~可否告知一下数据集的全名呢,这个简称没有搜到 T_T 是databaker对吗?
@dtx525942103 同求
weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar
感恩的心 感谢有你~~
I have trained about 1000 epochs, not fully trained, but the result seems impressive.
I upload several examples on Mandrain, for anyone interested: 中文语音合成实例.zip
@hemath1001 the chinese model :https://github.com/dtx525942103/vits_chinese/issues/3
is that able to train on Chinese dataset? Hello, can you tell me if this error occurs when using the phonemizer function with the backend parameter as espeak?(RuntimeError: failed to find espeak library) I would like to know how to download the espeak. Thank you!
@dtx525942103 你好,你训练的效果非常棒,请问你训练的时候是不是设置了 add_blank=True ?
yes, you have to add this arg in config file.
I provide a chinese example model in this repo. https://github.com/wac81/vits_chinese
vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing
我很想问一下,后验编码器为啥使用线性谱,不直接使用mel谱呢?我看论文里mel重建损失也是用mel谱计算的。。
vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing
我很想问一下,后验编码器为啥使用线性谱,不直接使用mel谱呢?我看论文里mel重建损失也是用mel谱计算的。。
论文里面说的,使用线性普的效果比使用mel谱的效果更好
大佬您的代码不开源了吗?已经找不到了~
@jaywalnut310 @TaoTeCha you said that we can control and change the energy and pitch by manipulating the latent representation (z in our code). Can you specify how ? i mean what values of z affect energy, pitch,....?