tacotron
tacotron copied to clipboard
good results
https://github.com/ggsonic/tacotron/blob/master/10.mp3 based on your code, i can get clear voices,like the one above. text is : The seven angels who had the seven trumpets prepared themselves to sound. you can hear some of the words clearly. the main changes are about 'batch_norm',i use instance normalization instead. And i think there are problems in the batch_norms. And there may have something wrong about hp.r related data flows. But i don't have time to figure it out for now. later this week i will commit my code and thanks your great works!
@ggsonic Nice work! If you share training time or curve as well as your modified code, it would be appreciated.
Plus, instance normalization instead of batch normalization... interesting. Is anyone willing to review my normalize.py
in modules.py
? If you see my batch normalization code in modules.py
, basically I use tf.contrib.layers.batch_norm
. Many complains the performance of the batch normalization code in TF is poor. So, they officially recommend we use a fused version for that reason. But the fused batch normalization doesn't work for 3-d tensor. So I reshaped a 3d input tensor to 4d before applying the fused batch normalization and then recover its shape.
this is my tensorboard graph. i use your provided dataset and your default hyperparams. batch 32,lr 0.001,200 epochs,trained for 8 hours using one single gpu card to get the above result. After 70 epochs,you can hear some words. But It seems that after 110 epochs, learning rate should be lowered , i will test it this weekend.
@ggsonic Nice work! Looking forward to your code modification.
committed! actually it is a simple idea and simple change.
@ggsonic Can you share the 10.mp3 with another resource (e.g. dropbox) I wasn't able to listen to your file. It was never downloaded.
Btw, your loss is still high, a loss with numbers such as ~0.001 is most likely to yield good results.
@basuam Just click Download and then "save as" (right click, or ctrl-s). Also loss doesn't have to be low to yield good results, they are not necessarily directly related to perceptual quality. See e.g. this paper
@Spotlight0xff I did right click and saved it but the default player (Windows Media Player) and even VLC cannot reproduce it. The problem with right click (in my case) is that it is not saving the file, is saving the metadata related to the file, that's why I cannot reproduce it. I'm using MATLAB to read the file and it kind of worked, I just need the Frequency Sample to listen clearly to whatever you have listened to. Without it, it is reproduced slowly or fast and it sounds like either a demon talking or a chipmunk talking.
Btw, the small loss is out of experience when working with AUDIO. I glimpsed the paper you attached but, it is applied to images and not to sequential data. We cannot extrapolate that information unless someone has tested that for "generative models in sequential data".
That is terrific output. Your output sounds similar to the audio samples https://google.github.io/tacotron/ without the post processing network and the vanilla seq2seq. The buzz is present, but a voice is clearly hearable. Great job! I look forward to replicating your output.
@ggsonic I've trained your latest commit to 200 epochs (seems to be the default?). Here is the trained model: https://ufile.io/yuu7e And the samples generated by default when running eval.py on the latest epoch.. https://ufile.io/41djx
Guys, I've run some training for around 32k global steps using @ggsonic s latest commit. I used a german dataset (pavoque in case someone is interested) and I've got some really cool results:
https://soundcloud.com/user-604181615/tacotron-german-pavoque-voice (Words are: "Fühlst du dich etwas unsicher").
I did the training on my GTX 1070 for around 12 hours.
Adjusted the sentence max characters according to my dataset as well as the wave length. I used Zero Masking for loss calculation.
I also observed that @ggsonic s instance normalization without the latest merged PRs gives better results.
@chief7 did you do any dynamic step size stuff? While I don't speak german, I think that it sounds really good. Perhaps it could be even better if we adjusted step size (as in the paper)?
No not yet. I definitely plan to do things like that but my first goal is or was to prove that it's worth spending all that time :D I'll keep you posted on my progress!
Am 11. Juni 2017 1:00:28 nachm. schrieb DarkDefender [email protected]:
@chief7 did you do any dynamic step size stuff? While I don't speak german, I think that it sounds really good. Perhaps it could be even better if we adjusted step size (as in the paper) the result would be even better?
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/Kyubyong/tacotron/issues/30#issuecomment-307622103
@chief7 Can you try with the database that is mentioned in the github? I'm really impressed that "Fühlst du dich etwas unsicher" is so clear, a little bit robotic but still, it's so clear. I'm pretty sure it can be improved, it looks like you have found a way.
I would like to know if the dataset that we are using is not big enough in comparison to the dataset you have used. Hopefully, you can try with the Bible dataset. Thank you for your time.
@chief7 could you please tell us the size of your corpus?
Sorry guys, I totally forgot to answer ... weekend.
I use a really small corpus. Around 5.3 hours of utterances. It seems to be enough to generate random new sentences.
So I guess the bible corpus isn't too small. I'll try to check as soon as possible but my computing resources are limited.
according to @barronalex 's code, the frames should be reshaped to output multiple non-overlapping frames at each time step. then in eval.py, reshape these frames back to the normal overlapping representation.
an example data flow shown below. This seems to be the correct hp.r related data flow as the paper described. Look forward to get better results after doing so.
[[ 1 1 1 1 1 1 1] [ 2 2 2 2 2 2 2] [ 3 3 3 3 3 3 3] [ 4 4 4 4 4 4 4] [ 5 5 5 5 5 5 5] [ 6 6 6 6 6 6 6] [ 7 7 7 7 7 7 7] [ 8 8 8 8 8 8 8] [ 9 9 9 9 9 9 9] [10 10 10 10 10 10 10] [11 11 11 11 11 11 11] [12 12 12 12 12 12 12] [13 13 13 13 13 13 13] [14 14 14 14 14 14 14] [15 15 15 15 15 15 15] [16 16 16 16 16 16 16] [17 17 17 17 17 17 17] [18 18 18 18 18 18 18] [19 19 19 19 19 19 19] [20 20 20 20 20 20 20] [21 21 21 21 21 21 21] [22 22 22 22 22 22 22] [23 23 23 23 23 23 23] [24 24 24 24 24 24 24] [25 25 25 25 25 25 25] [26 26 26 26 26 26 26] [27 27 27 27 27 27 27] [28 28 28 28 28 28 28] [29 29 29 29 29 29 29] [30 30 30 30 30 30 30] [31 31 31 31 31 31 31] [32 32 32 32 32 32 32] [33 33 33 33 33 33 33] [34 34 34 34 34 34 34] [35 35 35 35 35 35 35] [36 36 36 36 36 36 36]]
reshaped to[[ 1 1 1 1 1 1 1 5 5 5 5 5 5 5 9 9 9 9 9 9 9 13 13 13 13 13 13 13 17 17 17 17 17 17 17] [ 2 2 2 2 2 2 2 6 6 6 6 6 6 6 10 10 10 10 10 10 10 14 14 14 14 14 14 14 18 18 18 18 18 18 18] [ 3 3 3 3 3 3 3 7 7 7 7 7 7 7 11 11 11 11 11 11 11 15 15 15 15 15 15 15 19 19 19 19 19 19 19] [ 4 4 4 4 4 4 4 8 8 8 8 8 8 8 12 12 12 12 12 12 12 16 16 16 16 16 16 16 20 20 20 20 20 20 20] [21 21 21 21 21 21 21 25 25 25 25 25 25 25 29 29 29 29 29 29 29 33 33 33 33 33 33 33 0 0 0 0 0 0 0] [22 22 22 22 22 22 22 26 26 26 26 26 26 26 30 30 30 30 30 30 30 34 34 34 34 34 34 34 0 0 0 0 0 0 0] [23 23 23 23 23 23 23 27 27 27 27 27 27 27 31 31 31 31 31 31 31 35 35 35 35 35 35 35 0 0 0 0 0 0 0] [24 24 24 24 24 24 24 28 28 28 28 28 28 28 32 32 32 32 32 32 32 36 36 36 36 36 36 36 0 0 0 0 0 0 0]]
@ggsonic In the latest pull request, I have added this feature. Could you please help check whether it is right?
@candlewill not exactly. A simple tf.reshape handling above example data will get [[ 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5] [ 6 6 6 6 6 6 6 ......]
but we need the non-overlapping frames [[ 1 1 1 1 1 1 1 5 5 5 5 5 5 5 9 9 9 9 9 9 9 13 13 13 13 13 13 13 17 17 17 17 17 17 17] [ 2 2 2 2 2 2 2 ......]
these will do the trick like paper said:
This is likely because neighboring speech frames are correlated and each character usually corresponds to multiple frames. Emitting one frame at a time forces the model to attend to the same input token for multiple timesteps; emitting multiple frames allows the attention to move forward early in training.
i think get_spectrograms function in utils.py should do this trick.
@ggsonic Thanks! I guess you're right. I've changed the reduce_frames
and adjust other relevant parts.
@ggsonic Have you tested the new commits regarding reduce_frames
and seen if it works better?
Could you explain more why it's [[ 1 1 1 1 1 1 1 5 5 5 5 5 5 5 9 9 9 9 9 9 9 13 13 13 13 13 13 13 17 17 17 17 17 17 17] [ 2 2 2 2 2 2 2 ......
and not [[ 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5] [ 6 6 6 6 6 6 6 ......]
?
In my mind, the latter seems more correct. The paper said "neighboring speech frames are correlated" and therefore, these neighboring frames are the ones that must be grouped and predicted together so the attention can move forward faster. Unless [1 1 1 1 1 1 1]
and [5 5 5 5 5 5 5]
are considered "neighboring frames" while [1 1 1 1 1 1 1]
and [2 2 2 2 2 2 2]
are not, the first reshaping (the one currently committed) does not make much sense to me.
@reiinakano In my experiments, the new reduce_frames method can make the training process stable, while the former method always had somewhat "mode collapse". But the new method might need more global steps to get better results and i am still waiting .
@ggsonic Okay, am also running it right now with default hyperparameters. Do you use batch normalization or should I just stick with instance normalization? Currently at epoch 94 with batch norm and no voice.. :(
Edit: iirc, mode collapse is for GANs. What do you mean when you say mode collapse in this context?
Edit2: My loss curves so far epoch 94. Any comments?
I didn't have much luck with the new features introduced during the last commits. I do get the best results with 7ed2f209233c307b968c7080bc36fda3a70f6707 by ggsonic and the loss curves are similar to the ones posted by @reiinakano - especially when it comes to numbers. The sample I uploaded last week was sampled from the model while loss was around 1.2... just in case someone's interested.
@chief7 Thanks for sharing. Perhaps we should rethink the reduce_frames
method? What is your opinion on the "neighboring speech frames" discussion.
Update: Am at epoch 115 and still no voice can be heard
As far as I understand what the paper says they're predicting frames 1..r
in at once. If that's correct then the current state of the reduce_frames
method is not correct. Though I'm still digging into this ...
@reiinakano It is possible that after certain epochs your learning rate is too high for the model to converge, hence the spiking up, going down, repeat.
Here are some of the best results I've obtained after training it for about 2 days (before reduce frames commit with instance normalization). It seems that reduce frames is actually making the convergence take longer. Here is the script, 2: Talib Kweli confirmed to All Hip Hop that he will be releasing an album in the next year. 8: The quick brown fox jumps over the lazy dog 11: Basilar membrane and otolaryngology are not auto correlations. 12: would you like to know more about something. 17: Generative adversarial network or variational auto encoder. 19: New Zealand is an island nation in the south western pacific ocean. 31: New Zealand's capital city is wellington while its most populous city is auckland. https://www.dropbox.com/s/o1yhsaew8h2lsix/examples.zip?dl=0
@minsangkim142 What is your learning rate? I have reverted to the 7ed2f20 commit by @ggsonic and am at epoch 157. I am using the default learning rate of 0.001. So far, it's been better than with reduce_frames
(can hear a very robotic voice) but not really hearing actual words yet. The loss curve also looks much more steady now gradually going down.
I started with 0.001 then moved it down to 0.0005, 0.0003 and 0.0001 as suggested by the original paper, except I changed my learning rate every time the model spiked (suggesting that it may have jumped out of the local minimum) in which case I reverted the model before it spiked and changed the learning rate. I started hearing some clear voices after 30k global timesteps with dataset size of 4 hrs of utterance which is about 190 epochs.
Also I used .npy objects and np.memmap instead of librosa.load which increased the globalsteps/second by about twice the original rate.
@minsangkim142 can you share with us those clear voices? Thank you for your help and knowledge.
@minsangkim142 did you use a more sophisticated save/restore logic or did you go for the one checked in here? Or did you turn the whole process of learning into a supervised process and adjusted everything manually?