FastSpeech2
FastSpeech2 copied to clipboard
Is there any research about appropriate size of dataset for Fastspeech2?
I'm training Fastspeech2 with multi-lingual TTS Dataset like below.
- Number of data : 300000 English(~44000) + Chinese(~80000) + Spanish(~30000) + Japanese(~7000) + Korean(~130000) : Total ~300000
- Number of speaker : about 600
- Number of Phoneme token : 250
- Number of Stress Token : 28 (Stress is separated from phoneme token)
It seems too much for the hidden dim 256 fastspeech2 model. I think the result of the model is slightly unstable or prosody of synthesized speech is too formal. ###sample https://user-images.githubusercontent.com/44384060/154802366-3e1a959f-8652-4adb-95f8-f234ceb09d87.mp4 https://user-images.githubusercontent.com/44384060/154802368-2743543d-1c8b-4be3-aaaf-46102474a788.mp4
So I'm trying to increase hidden dim to 384 or 512(maybe it became slowspeech). Has anyone studied the appropriate size of FastSpeech 2's dataset? Can I look forward to the quality improvement by this?
+++ I'm also interested
This is a paper accepted to ACL 2022 about data requirements. It's definitely in the other direction - seeing how little data can be used for low-resource languages, but it might be helpful for other people finding this issue on GitHub.