autovc icon indicating copy to clipboard operation
autovc copied to clipboard

Does anyone reproduce the sound quality in the demo page?

Open WeiLi233 opened this issue 5 years ago • 15 comments

WeiLi233 avatar Aug 19 '19 01:08 WeiLi233

Could any researchers share some advice or experience in the reproducing procedure? For me it's really difficult to reproduce the sound quality in the demo page, and I am very confused which link broken.

WeiLi233 avatar Aug 19 '19 01:08 WeiLi233

Even I have spent a lot of time to reproduce it, I cannot either. But with VoxCeleb2 dataset (in-the-wild data with around 6000 speakers), not VCTK. The network is learned well for auto-encoding but not generalizable to voice conversion. Any comment?

iyah4888 avatar Sep 01 '19 18:09 iyah4888

I tried for several times,the voice qulity stuck in a sad point whether VCTK dataset or my own chinese speech dataset.........so sad.....

xuexidi avatar Oct 01 '20 03:10 xuexidi

I was able to produce audio that comprised of 'ghostly' voices after 100k iterations. There was however a lot of noise. Have either of you @WeiLi233 @xuexidi been able to train models that perform similarly to the pretrained model provided? I am concerned I have been doing something wrong as my model produces audio with a lot of noise, and very poor in comparison to the the audio examples provided in the paper.

Trebolium avatar Dec 12 '20 19:12 Trebolium

First of all, I want to thank the author to make the code public. The code is neat and friendly to read.

But sadly, the model doesn't perform well enough as I expected after I listen to their public demo.

I tried to use the public code and the pre-trained model to inference on some waveform files provided by the author in the wavs directory. But sadly, the model can not even produce satisfying results in these samples. Although the mel-spectorgram looks good, if you listen to the output audio, you can't even understand the content. I also tried to test for unseen speakers by recording myself's voice with my computer's microphone, it doesn't work either. The model's output sounds more like a noise than a voice. I guess it is because the model is not robust enough, since my recorded audio's spectrogram looks different from VCTK dataset because different recording devices were used.

JiachuanDENG avatar Dec 23 '20 03:12 JiachuanDENG

The pre-trained model is for demonstration purposes only. The model should perform well after careful re-training. As far as I know, someone has made a voice conversion phone app for mandarin Chinese based on this model.

auspicious3000 avatar Dec 23 '20 03:12 auspicious3000

I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model?

followings are training logs and one of the final converted wavform for test.

50W 100W result_wav

JohnHerry avatar Jan 27 '21 11:01 JohnHerry

Check that the tensors are the same shape before computing their loss?

On Wed, Jan 27, 2021 at 11:15 AM JohnHerry [email protected] wrote:

I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model?

followings are training logs and one of the final converted wavform for test.

[image: 50W] https://user-images.githubusercontent.com/8011802/105983486-cb4b2e80-60d3-11eb-8190-9713ddad0d4c.jpg [image: 100W] https://user-images.githubusercontent.com/8011802/105983507-d1d9a600-60d3-11eb-8e0b-bbe22763b3e0.jpg [image: result_wav] https://user-images.githubusercontent.com/8011802/105983515-d605c380-60d3-11eb-9d62-10826b4009ec.jpg

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/auspicious3000/autovc/issues/20#issuecomment-768215342, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIMKTH2VQJOCB2S7L6OQZ3TS37YVZANCNFSM4IMU6OHQ .

Trebolium avatar Feb 01 '21 13:02 Trebolium

I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model?

followings are training logs and one of the final converted wavform for test.

50W 100W result_wav

@JohnHerry English:the input tensor shape of loss function is missmatch, i think it is a bug from the source code, the sound quality could be improve a liitle if you recorrect the input tensor shape of loss function ~ Chinese:它的源码里的loss函数的输入tensor的尺寸不一样,导致loss的G/loss_cd项很快就收敛到0.0001,这其实意味着模型压根就学不到东西,合成不出任何语音。我尝试过修改loss函数的输入张量,使得其输入tensor的尺寸一致,重新训练的时候发现loss的各个分量下降速度都比较正常,最终也能出来人声,但是音质不是很好,估计是我没有重新训练wavenet的原因吧。

xuexidi avatar Feb 01 '21 13:02 xuexidi

Check that the tensors are the same shape before computing their loss? On Wed, Jan 27, 2021 at 11:15 AM JohnHerry @.***> wrote: I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model? followings are training logs and one of the final converted wavform for test. [image: 50W] https://user-images.githubusercontent.com/8011802/105983486-cb4b2e80-60d3-11eb-8190-9713ddad0d4c.jpg [image: 100W] https://user-images.githubusercontent.com/8011802/105983507-d1d9a600-60d3-11eb-8e0b-bbe22763b3e0.jpg [image: result_wav] https://user-images.githubusercontent.com/8011802/105983515-d605c380-60d3-11eb-9d62-10826b4009ec.jpg — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIMKTH2VQJOCB2S7L6OQZ3TS37YVZANCNFSM4IMU6OHQ .

I think you are right, I realized this bug a few months ago.....

xuexidi avatar Feb 01 '21 13:02 xuexidi

I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model? followings are training logs and one of the final converted wavform for test. 50W 100W result_wav

@JohnHerry English:the input tensor shape of loss function is missmatch, i think it is a bug from the source code, the sound quality could be improve a liitle if you recorrect the input tensor shape of loss function ~ Chinese:它的源码里的loss函数的输入tensor的尺寸不一样,导致loss的G/loss_cd项很快就收敛到0.0001,这其实意味着模型压根就学不到东西,合成不出任何语音。我尝试过修改loss函数的输入张量,使得其输入tensor的尺寸一致,重新训练的时候发现loss的各个分量下降速度都比较正常,最终也能出来人声,但是音质不是很好,估计是我没有重新训练wavenet的原因吧。

It works, thank you very much!

JohnHerry avatar Mar 03 '21 06:03 JohnHerry

I am using AISHELL-3 mandarin corpus to training the VC model; for preprocessing, the speaker embedder using the pretrained 3000000-BL.ckpt. run through the main.py which train 1000000 iters [although maybe no loss decrease after 500000 iters], but the result model is not good, nothing clearly can be heard except noise. Is there any suggestion to train a good mandarin model? followings are training logs and one of the final converted wavform for test. 50W 100W result_wav

@JohnHerry English:the input tensor shape of loss function is missmatch, i think it is a bug from the source code, the sound quality could be improve a liitle if you recorrect the input tensor shape of loss function ~ Chinese:它的源码里的loss函数的输入tensor的尺寸不一样,导致loss的G/loss_cd项很快就收敛到0.0001,这其实意味着模型压根就学不到东西,合成不出任何语音。我尝试过修改loss函数的输入张量,使得其输入tensor的尺寸一致,重新训练的时候发现loss的各个分量下降速度都比较正常,最终也能出来人声,但是音质不是很好,估计是我没有重新训练wavenet的原因吧。

你好,请问最终loss_id能够训练到多少啊,我训练的模型 loss_id在0.001左右就下不去了。

kingofview avatar Mar 17 '21 03:03 kingofview

I think my loss_cd went down to 0.002/0.001, but the loss_id wouldn't get that low because we are inferring from a bottleneck after all, meaning the reconstruction of the spectrogram will never be perfect. The question is how good do your reconstructed mel spectrogram sound?

Trebolium avatar Mar 17 '21 16:03 Trebolium

I think my loss_cd went down to 0.002/0.001, but the loss_id wouldn't get that low because we are inferring from a bottleneck after all, meaning the reconstruction of the spectrogram will never be perfect. The question is how good do your reconstructed mel spectrogram sound?

We had tested the speaker similarity of GT audio y and generated audio y' from this model. With Cosine( SpeakerEmbedding(y), SpeakerEmbedding(y')) [The speaker embedding model is a third party pretrained model]. Result values of this AutoVC model are between 0.3 and 0.6, better for seen speaker and bad for unseen. By contrast, Result values from ESPnet seq2seq VC model are between 0.93 and 0.95.

So genearted audio from this AutoVC model are not good enough in both naturalness and similarity. But, AutoVC model is very simiple, and quick, and It support zero-shot unseen speaker tts.

JohnHerry avatar Mar 19 '21 08:03 JohnHerry

The pre-trained model is for demonstration purposes only. The model should perform well after careful re-training. As far as I know, someone has made a voice conversion phone app for mandarin Chinese based on this model.

Could you kindly tell me which team or github repo made the mobile voice conversion app? thank you.

dragen1860 avatar Oct 27 '21 06:10 dragen1860