Emotions / expressions?
Hi, guys.
I'm collecting speech samples in order to create a dataset to train a new pt_BR model. My question is: does piper tts support emotions / expressions in generated speech?
There are two ways of doing this that I know about, but only one that I've tried.
The first way is training a "multi-speaker" model where each "speaker" is an emotion. I did this with a cool dataset provided by @thorstenMueller for Mimic 3 and I'm training a new voice like it for Piper now. The downside of this approach, of course, is that each sentence can only have one emotion.
The second way is to create new "phonemes" that represent the emotion. In Piper, this could be any UTF-8 codepoint that you add to the voice's phoneme_id_map. You'd need a dataset with emotion markings, and somehow translate those into phonemes (maybe a begin/end for each emotion?). I haven't tried this yet, since I don't know of any dataset that has emotions tagged in such a way.
But Can't you use SSML for emotion, and make it that each word can be an emotion. Like your Opentts Project?
<speak>
<voice name="glow-speak:en-us_mary_ann_angry">
<s>
Kwaheri
</s>
</voice>
<voice name="glow-speak:en-us_mary_ann_happy">
<s>
Kwaheri
</s>
</voice>
</speak>
<voice name="glow-speak:en-us_mary_ann_happy">
<s>
Good bye
</s>
</voice>
</speak>
I don't have a dataset where the audio from Mary Ann is split out by emotion.
What I'm saying is with the right data you can do it
There are two ways of doing this that I know about, but only one that I've tried.
The first way is training a "multi-speaker" model where each "speaker" is an emotion. I did this with a cool dataset provided by @thorstenMueller for Mimic 3 and I'm training a new voice like it for Piper now. The downside of this approach, of course, is that each sentence can only have one emotion.
The second way is to create new "phonemes" that represent the emotion. In Piper, this could be any UTF-8 codepoint that you add to the voice's
phoneme_id_map. You'd need a dataset with emotion markings, and somehow translate those into phonemes (maybe a begin/end for each emotion?). I haven't tried this yet, since I don't know of any dataset that has emotions tagged in such a way.
Oh, ok. So basically, I have to record it like multi speaker process... But how will piper "know" which speaker to play?
Hello @synesthesiam !
I have a huge dataset of 243,700 voices with emotion markers like this for 8 different emotions :
"😲"
"😠"
"😕"
"😐"
"😊"
"😒"
"😨"
"😢"
I'm trying to use emojis as markers for my dataset, but when training/converting, it translates the emojis into phonemes and says out loud "angry face It's locked for a reason. angry face":
😠 It's locked for a reason. 😠
[2024-06-02 00:57:24.706] [piper] [debug] Phonemizing text: 😠 It's locked for a reason. 😠
[2024-06-02 00:57:24.710] [piper] [debug] Converting 37 phoneme(s) to ids: ˈæŋɡɹi fˈeɪs ɪts lˈɑːkt fɚɹɚ ɹˈiːzən.
[2024-06-02 00:57:24.711] [piper] [debug] Converted 37 phoneme(s) to 77 phoneme id(s): 1, 0, 120, 0, 39, 0, 44, 0, 66, 0, 88, 0, 21, 0, 3, 0, 19, 0, 120, 0, 18, 0, 74, 0, 31, 0, 3, 0, 74, 0, 32, 0, 31, 0, 3, 0, 24, 0, 120, 0, 51, 0, 122, 0, 23, 0, 32, 0, 3, 0, 19, 0, 60, 0, 88, 0, 60, 0, 3, 0, 88, 0, 120, 0, 21, 0, 122, 0, 38, 0, 59, 0, 26, 0, 10, 0, 2,
[2024-06-02 00:57:24.711] [piper] [debug] Synthesizing audio for 77 phoneme id(s)
[2024-06-02 00:57:24.876] [piper] [debug] Synthesized 2.449705215419501 second(s) of audio in 0.165027265 second(s)
[2024-06-02 00:57:24.877] [piper] [debug] Converting 12 phoneme(s) to ids: ˈæŋɡɹi fˈeɪs
[2024-06-02 00:57:24.878] [piper] [debug] Converted 12 phoneme(s) to 27 phoneme id(s): 1, 0, 120, 0, 39, 0, 44, 0, 66, 0, 88, 0, 21, 0, 3, 0, 19, 0, 120, 0, 18, 0, 74, 0, 31, 0, 2,
[2024-06-02 00:57:24.878] [piper] [debug] Synthesizing audio for 27 phoneme id(s)
Should I modify how the phonemizing process occurs inside of piper-phonemize itself ? Do I need to make this type of UTF-8 character generate blank sound ? How can I achieve that ? Or maybe it's in the training process by adding these phonemes in config.json? But still, it will not use them as they are because it will first phonemize them, and they will translate into phonemes like "æŋɡɹi fˈeɪs".
@Haurrus Could you please provide the dataset? Maybe there are some ways to make a piper with emotion prompt.
I was thinking the same about emotions and or expressions. I don't know how it will be handled in Piper.
I was thinking the same about emotions and or expressions. I don't know how it will be handled in Piper.
An easy way, not requiring any metatags or SSML codes, is having different datasets, training different voices.
I'm closing this issue since currently piper won't add emotions/expressions. Moving to f5-tts.