gst-tacotron
gst-tacotron copied to clipboard
Mumbling in synthesis
Hey, thanks for the implementation @syang1993!
I'm using this code to implement another paper and I've bumped into some issues during synthesis. I'm getting good alignment on training and the interim synthesised results sound good, however during evaluation, the synthesis is very unpredictable and sometimes fails to synthesise understandable speech. It rather sounds like mumbling. It's not only on long utterances, but sometimes on short and mid-length texts too. I'm attaching a few alignment plots and audio examples.
I was wondering if you've come across this before and if you have any tips where I should look to fix this issue? I've trained the model using the multihead attention, do you reckon the GMM attention will improve a lot?
mumbling_samples.zip
This paper attempts to address this https://arxiv.org/abs/1910.10288 https://google.github.io/tacotron/publications/location_relative_attention/index.html
There appears to be a PyTorch implementation https://github.com/bshall/Tacotron