FastSpeech2 Is there any research about appropriate size of dataset for Fastspeech2?

Is there any research about appropriate size of dataset for Fastspeech2?

Open LEECHOONGHO opened this issue 3 years ago • 2 comments

I'm training Fastspeech2 with multi-lingual TTS Dataset like below.

Number of data : 300000 English(~44000) + Chinese(~80000) + Spanish(~30000) + Japanese(~7000) + Korean(~130000) : Total ~300000
Number of speaker : about 600
Number of Phoneme token : 250
Number of Stress Token : 28 (Stress is separated from phoneme token)

It seems too much for the hidden dim 256 fastspeech2 model. I think the result of the model is slightly unstable or prosody of synthesized speech is too formal. ###sample https://user-images.githubusercontent.com/44384060/154802366-3e1a959f-8652-4adb-95f8-f234ceb09d87.mp4 https://user-images.githubusercontent.com/44384060/154802368-2743543d-1c8b-4be3-aaaf-46102474a788.mp4

So I'm trying to increase hidden dim to 384 or 512(maybe it became slowspeech). Has anyone studied the appropriate size of FastSpeech 2's dataset? Can I look forward to the quality improvement by this?

Feb 19 '22 13:02 LEECHOONGHO

+++ I'm also interested

Apr 15 '22 07:04 aidosRepoint

This is a paper accepted to ACL 2022 about data requirements. It's definitely in the other direction - seeing how little data can be used for low-resource languages, but it might be helpful for other people finding this issue on GitHub.

May 03 '22 18:05 roedoejet

FastSpeech2 FastSpeech2 copied to clipboard

Is there any research about appropriate size of dataset for Fastspeech2?

FastSpeech2
FastSpeech2 copied to clipboard