tacotron icon indicating copy to clipboard operation
tacotron copied to clipboard

A great improvement has been made for master branch (LJSpeech)

Open begeekmyfriend opened this issue 6 years ago • 27 comments

Hi all, I have a good news that we might not be worried about the slow alignment and huge quantity of dataset anymore for this project for I have found a solution to improve it. Several days ago I was still talking how many wav clips or how much duration it is needed for training. But now I have opened on my fork here. I have used the default LJSpeech for traing and as we can see, the alignment has been quick to be learnt in 7K steps and the quantity of the dataset was less than 20 hours (less than 10K clips). step-7000-align

The major modification is on the location sensitive attention model inspired by @Rayhane-mamah at commit https://github.com/begeekmyfriend/tacotron/commit/b02eee7b86019dfefb1ba36a4e193b186307daa6. I have abandoned the original AttentionWrapper API and the quality of audio still kept good on G&L vocoder. By the way, AttentionWrapper really sucks that we'd better not use it.

The Chinese mandarin version is also opened on this branch. Welcome to have a try.

begeekmyfriend avatar Aug 17 '18 03:08 begeekmyfriend

Hi, thank you for your work. I am very impressed by your fork as well. Could you elaborate a little bit how the alignment curve would look on the Mandarin version in early training phases? I am trying to replicate a good alignment result on the THCHS30 data. Thank you!

jiamingkong avatar Aug 26 '18 08:08 jiamingkong

The alignment would appear in 10K steps unless your dataset is abnormal.

begeekmyfriend avatar Aug 26 '18 09:08 begeekmyfriend

step-10000-align

Oh wow, it did show up!

jiamingkong avatar Aug 26 '18 11:08 jiamingkong

You can even reduce you quantity to 1~5 hours, just try it.

begeekmyfriend avatar Aug 26 '18 11:08 begeekmyfriend

Nice work!

keithito avatar Aug 27 '18 20:08 keithito

@keithito The modification is very large but effective. There are two major aspects. First I have replaced AttentionWrapper interface. Second I have added stop token target for training to learn when to stop decoding. If I shoot a PR, would you like to accept it?

begeekmyfriend avatar Aug 28 '18 02:08 begeekmyfriend

@begeekmyfriend Yes, if you send over a PR, I would be happy to review and merge it.

keithito avatar Aug 28 '18 04:08 keithito

hello @begeekmyfriend ,Does the Chinese mandarin version fit thchs30 dataset?

peter05010402 avatar Sep 07 '18 07:09 peter05010402

@peter05010402 Of course it does.

begeekmyfriend avatar Sep 07 '18 08:09 begeekmyfriend

@begeekmyfriend thank you for reply. With single person's wav and ten, I train 30k steps, but no align quantity is 2 hours Could you help?

step-30000-align

peter05010402 avatar Sep 08 '18 05:09 peter05010402

The plot looks not like my fork, did you use https://github.com/begeekmyfriend/tacotron/tree/mandarin ?

begeekmyfriend avatar Sep 08 '18 13:09 begeekmyfriend

@begeekmyfriend thank you for your reply! Do you mean I should clone commit f8de0d7 ?

peter05010402 avatar Sep 10 '18 02:09 peter05010402

The original repo of Keith Ito is only for English, I guess what you want to train is Chines mandarin, don't you?

begeekmyfriend avatar Sep 10 '18 03:09 begeekmyfriend

@begeekmyfriend Thank you for your work.it alligned at 7k steps in thchs30! step-7000-align

but it not allign at 7k steps in one preson's speech of chths30.(The quantity is 2.15 hours) step-7000-align

Could you give some suggestion to improve it?

peter05010402 avatar Sep 10 '18 07:09 peter05010402

@begeekmyfriend @keithito Actually not so complicated...Just like this: from models.attention import LocationSensitiveAttention attention_mechanism = LocationSensitiveAttention(hp.attention_dim, encoder_outputs, hparams=hp, mask_encoder=hp.mask_encoder, memory_sequence_length=input_lengths, smoothing=hp.smoothing, cumulate_weights=hp.cumulative_weights) and replace origin code in AttentionWrapper "BahdanauAttention(hp.attention_depth, encoder_outputs)" with "attention_mechanism"

This attention could solve the problem of skip and repeat phenomenon.

hyzhan avatar Sep 14 '18 09:09 hyzhan

@begeekmyfriend. Is begeekmyfriend's model changed from the original model? keithito's model has 3 GRU in the decoder. begeekmyfriend's model has 2 GRU in the decoder.

Is it right?

begeekmyfriend

hccho2 avatar Oct 22 '18 03:10 hccho2

Thanks @begeekmyfriend I've tested your repo with the LJspeech dataset, training on a GPU, and left all hparams at their default. Yes, I did see that alignment happens more quickly, but the sound quality does not appear to be as clear compared to the mimic2 fork. I've reached 71,000 steps so far, and while the voice is clear, there is a lot of static in the background. Sample output: https://drive.google.com/file/d/1khDaXrfRhFh15JRNT87NI1Hv8FcBuFXC/view?usp=sharing

Here is the same phrase, on the mimic2 repo at 68,000 steps: https://drive.google.com/open?id=1W87m942sfuEh4UCD6n6tQXr6JSAytXRt

Any idea why there is so much static? Will more training solve this? or should I adjust some of the hparams?

EDIT I just realized I was using the mandarin branch. I'll switch to master, and try again.

Alignment using your repo after 71,000 steps step-71000-align

sjtilney avatar Dec 06 '18 18:12 sjtilney

@sjtilney In fact I have added some wired solution in the mandarin branch which have not merged into the master branch. I will do that soon. If you want listen the latest evaluation from G&L synthesizer. Here is the Chinese mandarin sample https://github.com/Rayhane-mamah/Tacotron-2/issues/292#issuecomment-444823633

begeekmyfriend avatar Dec 07 '18 01:12 begeekmyfriend

@begeekmyfriend If I'm training on the LJspeech dataset, should I use the latest commit from the Master branch?

sjtilney avatar Dec 07 '18 05:12 sjtilney

I have been traing on Chinese mandarin, you might merge the latest version of that branch into the master one.

begeekmyfriend avatar Dec 07 '18 05:12 begeekmyfriend

@begeekmyfriend @peter05010402 I am confused by the hparameter setting, please help to review. my training data is thchs30, sample rate is 16khz.

in https://github.com/begeekmyfriend/tacotron/tree/mandarin the hparams.py show: num_mels=80, num_freq=2049, sample_rate=48000,

but I used to use. num_mels=80, num_freq=1025, sample_rate=16000,

Thanks.

bjtommychen avatar Feb 01 '19 15:02 bjtommychen

@bjtommychen http://www.data-baker.com/open_source.html

begeekmyfriend avatar Feb 02 '19 01:02 begeekmyfriend

@begeekmyfriend Thanks for your quick response. I got biaobei.

I have another questions:

  1. with the latest code, after training 7k steps on biaobei. loss is 1.11, and after 30k steps, loss is 0.83, is it correct ? what's the target loss can reach ?
  2. Is it OK to add punctuation mark in label. like this 'ni2 hao3 , you3 piao4 ma ?',
  3. biaobei is 48khz. Is it possible to make training faster after convert 48khz into 16khz.

Thanks.

bjtommychen avatar Feb 02 '19 03:02 bjtommychen

  1. Loss does not matter, the final evaluation does.
  2. You need to add those marks on your own sentence by sentence.
  3. Yes, lower sample rate can be trained faster.

begeekmyfriend avatar Feb 02 '19 05:02 begeekmyfriend

您好,我尝试使用了您的mandarin分支进行中文语音的合成,我使用的是清华的data_thchs30数据集,未更改参数,但是合成出来的效果确实这个样子的: step-104000-align 您看问题出在那里了呢?谢谢您

MonkeyBLuffy avatar Mar 09 '20 07:03 MonkeyBLuffy

Please use Biaobei(标贝) open corpus.

begeekmyfriend avatar Mar 09 '20 07:03 begeekmyfriend

Hi, I have a problem with the alignment curve, at step 22000 there is no alignment at all. I am using an Arabic modified version of Keith's model on Egyptian Arabic. My dataset is 1145 sentences, 48k sample rate, and recorded under the supervision of a professional studio. I modified the hyperparameters, these are what I am using.

adam_beta1: 0.9 adam_beta2: 0.999 attention_depth: 256 batch_size: 8 cleaners: arabic_cleaners decay_learning_rate: True decoder_depth: 256 embed_depth: 256 encoder_depth: 256 frame_length_ms: 50 frame_shift_ms: 12.5 griffin_lim_iters: 60 initial_learning_rate: 0.001 max_frame_num: 1000 max_iters: 400 min_level_db: -100 num_freq: 2049 num_mels: 80 outputs_per_step: 1 postnet_depth: 256 power: 1.5 preemphasis: 0.97 prenet_depths: [256, 128] ref_level_db: 20 sample_rate: 48000 Loaded metadata for 1145 examples (0.90 hours)

the data set is small, but with nearly the same size, I trained a Modern Standard Arabic model with Keith's hyperparameters and it gives good sound.

step-22000-align

could somebody help with this issue? Thanks in advance.

hishammadcor avatar Nov 12 '21 00:11 hishammadcor