tacotron
tacotron copied to clipboard
A great improvement has been made for master branch (LJSpeech)
Hi all, I have a good news that we might not be worried about the slow alignment and huge quantity of dataset anymore for this project for I have found a solution to improve it. Several days ago I was still talking how many wav clips or how much duration it is needed for training. But now I have opened on my fork here. I have used the default LJSpeech
for traing and as we can see, the alignment has been quick to be learnt in 7K steps and the quantity of the dataset was less than 20 hours (less than 10K clips).
The major modification is on the location sensitive attention model inspired by @Rayhane-mamah at commit https://github.com/begeekmyfriend/tacotron/commit/b02eee7b86019dfefb1ba36a4e193b186307daa6. I have abandoned the original AttentionWrapper
API and the quality of audio still kept good on G&L vocoder. By the way, AttentionWrapper
really sucks that we'd better not use it.
The Chinese mandarin version is also opened on this branch. Welcome to have a try.
Hi, thank you for your work. I am very impressed by your fork as well. Could you elaborate a little bit how the alignment curve would look on the Mandarin version in early training phases? I am trying to replicate a good alignment result on the THCHS30 data. Thank you!
The alignment would appear in 10K steps unless your dataset is abnormal.
Oh wow, it did show up!
You can even reduce you quantity to 1~5 hours, just try it.
Nice work!
@keithito The modification is very large but effective. There are two major aspects. First I have replaced AttentionWrapper
interface. Second I have added stop token target for training to learn when to stop decoding. If I shoot a PR, would you like to accept it?
@begeekmyfriend Yes, if you send over a PR, I would be happy to review and merge it.
hello @begeekmyfriend ,Does the Chinese mandarin version fit thchs30 dataset?
@peter05010402 Of course it does.
@begeekmyfriend thank you for reply. With single person's wav and ten, I train 30k steps, but no align quantity is 2 hours Could you help?
The plot looks not like my fork, did you use https://github.com/begeekmyfriend/tacotron/tree/mandarin ?
@begeekmyfriend thank you for your reply! Do you mean I should clone commit f8de0d7 ?
The original repo of Keith Ito is only for English, I guess what you want to train is Chines mandarin, don't you?
@begeekmyfriend Thank you for your work.it alligned at 7k steps in thchs30!
but it not allign at 7k steps in one preson's speech of chths30.(The quantity is 2.15 hours)
Could you give some suggestion to improve it?
@begeekmyfriend @keithito Actually not so complicated...Just like this: from models.attention import LocationSensitiveAttention attention_mechanism = LocationSensitiveAttention(hp.attention_dim, encoder_outputs, hparams=hp, mask_encoder=hp.mask_encoder, memory_sequence_length=input_lengths, smoothing=hp.smoothing, cumulate_weights=hp.cumulative_weights) and replace origin code in AttentionWrapper "BahdanauAttention(hp.attention_depth, encoder_outputs)" with "attention_mechanism"
This attention could solve the problem of skip and repeat phenomenon.
@begeekmyfriend. Is begeekmyfriend's model changed from the original model? keithito's model has 3 GRU in the decoder. begeekmyfriend's model has 2 GRU in the decoder.
Is it right?
Thanks @begeekmyfriend I've tested your repo with the LJspeech dataset, training on a GPU, and left all hparams at their default. Yes, I did see that alignment happens more quickly, but the sound quality does not appear to be as clear compared to the mimic2 fork. I've reached 71,000 steps so far, and while the voice is clear, there is a lot of static in the background. Sample output: https://drive.google.com/file/d/1khDaXrfRhFh15JRNT87NI1Hv8FcBuFXC/view?usp=sharing
Here is the same phrase, on the mimic2 repo at 68,000 steps: https://drive.google.com/open?id=1W87m942sfuEh4UCD6n6tQXr6JSAytXRt
Any idea why there is so much static? Will more training solve this? or should I adjust some of the hparams?
EDIT I just realized I was using the mandarin branch. I'll switch to master, and try again.
Alignment using your repo after 71,000 steps
@sjtilney In fact I have added some wired solution in the mandarin
branch which have not merged into the master
branch. I will do that soon. If you want listen the latest evaluation from G&L synthesizer. Here is the Chinese mandarin sample https://github.com/Rayhane-mamah/Tacotron-2/issues/292#issuecomment-444823633
@begeekmyfriend If I'm training on the LJspeech dataset, should I use the latest commit from the Master branch?
I have been traing on Chinese mandarin, you might merge the latest version of that branch into the master one.
@begeekmyfriend @peter05010402 I am confused by the hparameter setting, please help to review. my training data is thchs30, sample rate is 16khz.
in https://github.com/begeekmyfriend/tacotron/tree/mandarin the hparams.py show: num_mels=80, num_freq=2049, sample_rate=48000,
but I used to use. num_mels=80, num_freq=1025, sample_rate=16000,
Thanks.
@bjtommychen http://www.data-baker.com/open_source.html
@begeekmyfriend Thanks for your quick response. I got biaobei.
I have another questions:
- with the latest code, after training 7k steps on biaobei. loss is 1.11, and after 30k steps, loss is 0.83, is it correct ? what's the target loss can reach ?
- Is it OK to add punctuation mark in label. like this 'ni2 hao3 , you3 piao4 ma ?',
- biaobei is 48khz. Is it possible to make training faster after convert 48khz into 16khz.
Thanks.
- Loss does not matter, the final evaluation does.
- You need to add those marks on your own sentence by sentence.
- Yes, lower sample rate can be trained faster.
您好,我尝试使用了您的mandarin分支进行中文语音的合成,我使用的是清华的data_thchs30数据集,未更改参数,但是合成出来的效果确实这个样子的:
您看问题出在那里了呢?谢谢您
Please use Biaobei(标贝) open corpus.
Hi, I have a problem with the alignment curve, at step 22000 there is no alignment at all. I am using an Arabic modified version of Keith's model on Egyptian Arabic. My dataset is 1145 sentences, 48k sample rate, and recorded under the supervision of a professional studio. I modified the hyperparameters, these are what I am using.
adam_beta1: 0.9 adam_beta2: 0.999 attention_depth: 256 batch_size: 8 cleaners: arabic_cleaners decay_learning_rate: True decoder_depth: 256 embed_depth: 256 encoder_depth: 256 frame_length_ms: 50 frame_shift_ms: 12.5 griffin_lim_iters: 60 initial_learning_rate: 0.001 max_frame_num: 1000 max_iters: 400 min_level_db: -100 num_freq: 2049 num_mels: 80 outputs_per_step: 1 postnet_depth: 256 power: 1.5 preemphasis: 0.97 prenet_depths: [256, 128] ref_level_db: 20 sample_rate: 48000 Loaded metadata for 1145 examples (0.90 hours)
the data set is small, but with nearly the same size, I trained a Modern Standard Arabic model with Keith's hyperparameters and it gives good sound.
could somebody help with this issue? Thanks in advance.