merlin Unclear speech

I am trying to train Merlin with a new voice but the results are still not clear.

I am using phone alignment (I checked that the alignment is accurate)
I used Word Vocoder to extract acoustic features. And the regenerated audios are good (see attached file).
I am using a very simple question file, adapted with the phonemes used in the labels files.
For now I am using between 50 and 300 sentences but the results doesn't get better even when I increase the number of sentences.
My configuration is: [Architecture] hidden_layer_size : [1024, 1024, 1024, 1024, 1024, 1024] hidden_layer_type : ['TANH', 'TANH', 'TANH', 'TANH', 'TANH', 'TANH'] #if RNN or sequential training is used, please set sequential_training to True. sequential_training : False dropout_rate : 0.0 learning_rate : 0.002 batch_size : 256 output_activation: linear warmup_epoch : 10 warmup_momentum : 0.3 training_epochs : 25

The NN is not stopping early. And the errors for the acoustic model are: Develop: DNN -- MCD: 6.596 dB; BAP: 0.197 dB; F0:- RMSE: 13.796 Hz; CORR: 0.186; VUV: 19.690% Test : DNN -- MCD: 6.790 dB; BAP: 0.202 dB; F0:- RMSE: 14.571 Hz; CORR: 0.125; VUV: 24.537%

The audio file I get is in the attached file. Please change the extension from .pdf to .zip I always compare with the arctic_demo results that I got with 50 sentences, which are much better. This is the third database that I use and I am always getting similar results, plain fuzz with intonation. Does anyone have ideas of parameters I can change to make the voice clearer? does it seem like F0 extraction problem? Thank you in advance. results.pdf

Apr 12 '17 17:04 dhm42

Can you upload your question file and a sample label_phone_align lab file with the corresponding wave file? My main worry is that you are using the Festival scripts to generate your label files and those are for English but you are trying to synthesize French. The pronunciations of your words will be English and as such I don't see how you can get a good list of phonemes or accurate alignment of them to the input audio when they don't correspond at all with what was being said. Then when you try to train a DNN, your phoneme models will contain such diverse acoustic information that it doesn't know what to do with it. It will contain lots of voiced and unvoiced frames. Again, this is my guess just from listening to the result.

Do you have any experience with (French) text-to-speech? You need French linguistic preprocessing first to process your input text to get the correct phonemes, syllable stress, etc.

Apr 13 '17 07:04 dreamk73

@dreamk73 thank you for your answer. You can find the questions file, a wav file and the corresponding label_phone_align lab file attached. I didn't use Festival to generate or align the labels. I used a trained tool for HMM French TTS to do it.

PS: The question file is very basic, actually I did the same with the arctic_demo questions file (kept the central phonemes only) and the results is still good for 50 sentences. For the complete French questions file, the result gets worse. examples_zip.pdf

Apr 13 '17 08:04 dhm42

Thanks for clarifying. The label file and wav file indeed look ok. I would have expected at least clear sounding phonemes as well. I would probably dig into the label normalization script and make sure that the correct phonemes labels are normalized. And maybe WORLD is still doing something strange when extracting acoustic features. Why else would the model not converge and the VUV error be so high?

Apr 13 '17 09:04 dreamk73

Edited: I checked the input vector for the duration model and it seems ok: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]... But for the acoustic part I noticed that the vector loaded from acoustic_model/data/label_phone_align/ or test_synthesis/gen-lab/ files is strange. It looks like this: [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00 9.97336156e-01 4.59007154e-01 4.40498668e-02 2.64000000e+02]

Only the 4 last numbers are changing, and the number in bold is sometimes 1 and others 0. I don't know if that's normal. I only know that a similar vector is also loaded for demo_arctic data. The model converges after the 29 epoch but the result is not better. For the value of VUV I don't why it is so high.

Apr 13 '17 13:04 dhm42

How did you computed the objective values. regards, Safi

Dec 16 '18 04:12 arsafi

merlin merlin copied to clipboard

Unclear speech

merlin
merlin copied to clipboard