LPCNet icon indicating copy to clipboard operation
LPCNet copied to clipboard

Using with Tacotron2

Open ArnaudWald opened this issue 6 years ago • 80 comments

Hello,

I would like to connect a Tacotron2 model to LPCNet. Is there a way to convert the 80-mel coefficients (output of Taco2) into the 18 Bark scale + 2 pitch parameters (input of LPCNet) ?

And somehow related, when reading about the Bark scale like here on wikipedia, there is usually 24 coefficients, and I don't understand how they are only 18 computed here. Even taking into account the 16kHz sampling, that would leave 22 of them, right ?

Thanks a lot :)

ArnaudWald avatar Apr 23 '19 08:04 ArnaudWald

I am using Tacotron2 to predict 20 dim features for LPCNet. But there is noize in the synthesized audio.

superhg2012 avatar Jun 04 '19 10:06 superhg2012

我正在使用Tacotron2来预测LPCNet的20个暗淡特征。但合成音频中存在噪音。

Is there any way to improve the sound quality?

lyz04551 avatar Aug 23 '19 01:08 lyz04551

@superhg2012 I get the same problem, did you solve it?

byuns9334 avatar Sep 21 '19 02:09 byuns9334

I've tried with current master of tacotron2 and LPCTron but failed.

With an adaption of my fork using the correct hparams I'm generating high quality speech audios.zip

My fork with spanish branch + MlWoo adaption of LPCNet, you need to change your path and symbols, see the commit history: https://github.com/carlfm01/Tacotron-2/tree/spanish

carlfm01 avatar Sep 21 '19 03:09 carlfm01

@carlfm in your fork, could you let me know how to generate wav from f32 feature? and is it as same speed as original LPCNet?

byuns9334 avatar Sep 22 '19 02:09 byuns9334

how to generate wav from f32 feature? and is it as same speed as original LPCNet?

The tacotron repo is to predict the feature not the wav, to generate the wav with the predicted feature by tacotron, you need to use https://github.com/mlwoo/LPCNet fork

And for me, using sparsity of 200 is 3x faster than real time with AVX enabled

carlfm01 avatar Sep 22 '19 02:09 carlfm01

@carlfm01 I tried https://github.com/mlwoo/LPCNet fork already, but it generates wav too much noise, as I described in https://github.com/MlWoo/LPCNet/issues/6. How did you solve this problem? any suggestions please?

byuns9334 avatar Sep 22 '19 03:09 byuns9334

Noise using predicted features by tacotron or using the real features?

carlfm01 avatar Sep 22 '19 03:09 carlfm01

@carlfm01 using the real features. so I converted real wav -> (by ./dump_data) s16 -> (./test_lpcnet) f32 -> (by ffmpeg) wav, as explained in MlWoo's repo. It is supposed to convert the f32 back to original wav, but noise is severe (it contains original voice though). Have you experienced this? When you used MlWoo, were speed and audio quality both perfect? If yes, What did you modify from MlWoo's code? Thank you so much for help.

byuns9334 avatar Sep 22 '19 03:09 byuns9334

were speed and audio quality both perfect

Yes.

What did you modify from MlWoo's code?

Nothing.

My only guess is that may you made a mistake compiling your exported weights?

https://github.com/mozilla/LPCNet/issues/58#issuecomment-533470433

carlfm01 avatar Sep 22 '19 03:09 carlfm01

Using MlWoo's fork: feature.zip

carlfm01 avatar Sep 22 '19 03:09 carlfm01

@carlfm01 Thanks. Let me explain what i did so far in detail.

so now, I have to repositories : LPCNet (original LPCNet repo), LPCNet_MlWoo.

I trained LPCNet and got the nnet_data_* files in LPCNet/src directory. And I moved all of them to LPCNet_MlWoo/src, because when I tried './dump_lpcnet.py lpcnet15_384_10_G16_64.h5' (in LPCNet_MlWoo repo), it didn't work (because of some weird model shape error.). (lpcnet15_384_10_G16_64.h5 model was generated in original LPCNet repo)

and I ran i just ran 'make dump_data taco=1' and 'make test_lcpnet taco=1' .

Do you think these make sense? (I didn't change any parameter of LPCNet and LPCNet_MlWoo)

byuns9334 avatar Sep 22 '19 03:09 byuns9334

model was generated in original LPCNet repo

Thats the issue, I'm afraid you need to retrain using MlWoo fork, I did not trained with LPCNet(this repo)

carlfm01 avatar Sep 22 '19 03:09 carlfm01

@carlfm01 but Is there any difference between MLWoo's LPCNet training code and original LPCNet's LPCNet training code? aren't they exactly same?

byuns9334 avatar Sep 22 '19 03:09 byuns9334

@carlfm01 so you did everything (such as train LPCNet and inference the audio and etc) in MLWoo's repo, right? Which hyperparameters/options did you change?

byuns9334 avatar Sep 22 '19 03:09 byuns9334

but Is there any difference between MLWoo's LPCNet training code and original LPCNet's LPCNet training code? aren't they exactly same?

No, otherwise you will be able to load models from both. I also tried and throw an error about a missing layer or an extra layer, I can't recall. The inference code is also different.

so you did everything (such as train LPCNet and inference the audio and etc) in MLWoo's repo, right?

Yes, default.

The only thing that I changed was the training code to load checkpoints and adapt on new data.

This is missing on LPCNet_MlWoo

https://github.com/mozilla/LPCNet/blob/master/src/train_lpcnet.py#L106-L125

carlfm01 avatar Sep 22 '19 03:09 carlfm01

@carlfm01 okay, thank you so much. I will try. and you are insisting that when merging Tacotron2 + LPCNet, I better use your spanish fork for tacotron2 right?

byuns9334 avatar Sep 22 '19 03:09 byuns9334

@carlfm01 okay, thank you so much. I will try. and you are insisting that when merging Tacotron2 + LPCNet, I better use your spanish fork for tacotron2 right?

Yes, just change your paths and symbols, see the commit history to understand better. I've tried LPCTron and the tacotron master but both failed generating noisy speech.

carlfm01 avatar Sep 22 '19 03:09 carlfm01

@carlfm01 thank you so much 🙏. Wish you all the best. i will text u again when i get other questions

byuns9334 avatar Sep 22 '19 04:09 byuns9334

And share your results! 👍

carlfm01 avatar Sep 22 '19 04:09 carlfm01

@carlfm01 Hi, I followed all your instructions (re-train from MlWoo's repo) and now I've trained 6 epochs for test. the original wav is about 3 seconds long, but generated audio is about 8 seconds long. Have you experienced this problem?

byuns9334 avatar Sep 23 '19 01:09 byuns9334

Hello, no, I'm getting the same duration. Is it from real features?

carlfm01 avatar Sep 23 '19 01:09 carlfm01

@carlfm01 yes real features. Also I did './test_lpcnet ~.h5' well. This issue is strange.... I'll take a look more. thanks !

byuns9334 avatar Sep 23 '19 01:09 byuns9334

@carlfm01 Are sample rate, precision, sample encoding of your training wav files 16000, 16bit, 16-bit singed integer pcm?

byuns9334 avatar Sep 23 '19 01:09 byuns9334

Please make sure using make test_lpcnet taco=1 if you extracted the features with taco enabled on the ./dump_data, or disable taco for both

Yes, 16000, 16bit, mono

carlfm01 avatar Sep 23 '19 01:09 carlfm01

@carlfm01 I just ran both 'make dump_data taco=1' and 'make test_lpcnet taco=1', so they are both up-to-date.

byuns9334 avatar Sep 23 '19 01:09 byuns9334

What about quality? You get the same result cleaning and testing without taco? please also make sure you do make clean .

carlfm01 avatar Sep 23 '19 01:09 carlfm01

@carlfm01 If i want to do them without taco, should I do 'make dump_data' and 'make test_lpcnet' instead of 'make dump_data taco=1' and 'make test_lpcnet taco=1' ?

and yes, I think I did make clean

byuns9334 avatar Sep 23 '19 01:09 byuns9334

If i want to do them without taco, should I do 'make dump_data' and 'make test_lpcnet' instead of 'make dump_data taco=1' and 'make test_lpcnet taco=1' ?

Yes.

carlfm01 avatar Sep 23 '19 01:09 carlfm01

@carlfm01 It works now. incredible. The problem was that I didn't do make clean at very first step. Generated audio samples are extremely clean and inference speed is much faster than realtime. I will upload test results in few minutes here. Only suspicious thing is that this works perfectly even with 6 epochs training .... Thank you so much

byuns9334 avatar Sep 23 '19 02:09 byuns9334