Jaehyeon Kim

Results 16 comments of Jaehyeon Kim

Hi @snakers4. We reported inference speed tests on a GPU server rather than CPU only environments, as it is a representative indicator for speed comparison in many papers. I think...

Definitely, yes! But you may need any text-to-phone converter such as [Phonemizer](https://github.com/bootphon/phonemizer) to convert Chinese text into phonemes. This model gets phonemes as input rather than characters.

@LG-SS Now the paper is available: https://arxiv.org/abs/2106.06103 > > Definitely, yes! But you may need any text-to-phone converter such as [Phonemizer](https://github.com/bootphon/phonemizer) to convert Chinese text into phonemes. > > This...

> @jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits...

> @jaywalnut310 This model is autoregressive or non autoregressive ? Hi @leminhnguyen, this model is non autoregressive.

@leminhnguyen Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation...

Though my english is poor, I'll anwer in english for other people. Yes, the 127th line of train.py doesn't consider the number of gpus, which may cause misunderstanding about training...

Sorry for the dense calculation of the MLE loss... I'll let you know when I clean up the clutter in the code. Temporarily, I'll explain the loss one by one....

Yes the constant term is ignored in backpropgation. I just left it for exact calculation of log likelihood. And I saw AlignTTS, which also proposes an alignment search algorithm similar...

So your situation is: 1) you have your own multi-speaker dataset, and the total duration of the dataset is only one hour. 2) you trained the model with the LJ...