DeepLearningExamples icon indicating copy to clipboard operation
DeepLearningExamples copied to clipboard

[Fast Pitch 1.1] Road Map and Augmentations?

Open ArEnSc opened this issue 2 years ago • 3 comments

@alancucki I am a huge fan of the work here, it's honestly the most stable and fast TTS system I have used.

Is your feature request related to a problem? Please describe. Is there a way to provide emotive synthesis? where we can capture sadness and happiness and other emotions? I believe this can be done with a style token, a GST, similar to mellotron and GST-Tacotron2 If so what would be the challenges of doing so?

ArEnSc avatar Feb 21 '22 18:02 ArEnSc

Thanks!

IMHO to do proper emotive synthesis there has to be an emotive dataset. In FastPitch, conditioning on speaker, which is implemented for multi-speaker models, could be overloaded to handle different variants (like sub-speakers). The problem then shifts to data labeling, or unsupervised training.

alancucki avatar Mar 11 '22 12:03 alancucki

I just realized this the other day thanks!

ArEnSc avatar Mar 11 '22 14:03 ArEnSc

@alancucki one more question, is there a way to blend the two speakers in the multispeaker setup with FP 1.1? Thanks!

ArEnSc avatar Mar 14 '22 19:03 ArEnSc