piper Emotions / expressions?

Hi, guys.

I'm collecting speech samples in order to create a dataset to train a new pt_BR model. My question is: does piper tts support emotions / expressions in generated speech?

Jul 25 '23 12:07 gab-luz

There are two ways of doing this that I know about, but only one that I've tried.

The first way is training a "multi-speaker" model where each "speaker" is an emotion. I did this with a cool dataset provided by @thorstenMueller for Mimic 3 and I'm training a new voice like it for Piper now. The downside of this approach, of course, is that each sentence can only have one emotion.

The second way is to create new "phonemes" that represent the emotion. In Piper, this could be any UTF-8 codepoint that you add to the voice's phoneme_id_map. You'd need a dataset with emotion markings, and somehow translate those into phonemes (maybe a begin/end for each emotion?). I haven't tried this yet, since I don't know of any dataset that has emotions tagged in such a way.

Jul 25 '23 22:07 synesthesiam

But Can't you use SSML for emotion, and make it that each word can be an emotion. Like your Opentts Project?

<speak>
  <voice name="glow-speak:en-us_mary_ann_angry">
    <s>
      Kwaheri
    </s>
  </voice>

  <voice name="glow-speak:en-us_mary_ann_happy">
    <s>
      Kwaheri
    </s>
  </voice>
</speak>


  <voice name="glow-speak:en-us_mary_ann_happy">
    <s>
      Good bye
    </s>
  </voice>
</speak>

Sep 03 '23 07:09 mbonea-ewallet

I don't have a dataset where the audio from Mary Ann is split out by emotion.

Sep 05 '23 14:09 synesthesiam

What I'm saying is with the right data you can do it

Sep 13 '23 00:09 Aws-killer

There are two ways of doing this that I know about, but only one that I've tried.

The first way is training a "multi-speaker" model where each "speaker" is an emotion. I did this with a cool dataset provided by @thorstenMueller for Mimic 3 and I'm training a new voice like it for Piper now. The downside of this approach, of course, is that each sentence can only have one emotion.

The second way is to create new "phonemes" that represent the emotion. In Piper, this could be any UTF-8 codepoint that you add to the voice's phoneme_id_map. You'd need a dataset with emotion markings, and somehow translate those into phonemes (maybe a begin/end for each emotion?). I haven't tried this yet, since I don't know of any dataset that has emotions tagged in such a way.

Oh, ok. So basically, I have to record it like multi speaker process... But how will piper "know" which speaker to play?

Oct 13 '23 16:10 gab-luz

Hello @synesthesiam !

I have a huge dataset of 243,700 voices with emotion markers like this for 8 different emotions :

"😲"
"😠"
"😕"
"😐"
"😊"
"😒"
"😨"
"😢"

I'm trying to use emojis as markers for my dataset, but when training/converting, it translates the emojis into phonemes and says out loud "angry face It's locked for a reason. angry face":

😠 It's locked for a reason. 😠

[2024-06-02 00:57:24.706] [piper] [debug] Phonemizing text: 😠 It's locked for a reason. 😠
[2024-06-02 00:57:24.710] [piper] [debug] Converting 37 phoneme(s) to ids: ˈæŋɡɹi fˈeɪs ɪts lˈɑːkt fɚɹɚ ɹˈiːzən.
[2024-06-02 00:57:24.711] [piper] [debug] Converted 37 phoneme(s) to 77 phoneme id(s): 1, 0, 120, 0, 39, 0, 44, 0, 66, 0, 88, 0, 21, 0, 3, 0, 19, 0, 120, 0, 18, 0, 74, 0, 31, 0, 3, 0, 74, 0, 32, 0, 31, 0, 3, 0, 24, 0, 120, 0, 51, 0, 122, 0, 23, 0, 32, 0, 3, 0, 19, 0, 60, 0, 88, 0, 60, 0, 3, 0, 88, 0, 120, 0, 21, 0, 122, 0, 38, 0, 59, 0, 26, 0, 10, 0, 2,
[2024-06-02 00:57:24.711] [piper] [debug] Synthesizing audio for 77 phoneme id(s)
[2024-06-02 00:57:24.876] [piper] [debug] Synthesized 2.449705215419501 second(s) of audio in 0.165027265 second(s)
[2024-06-02 00:57:24.877] [piper] [debug] Converting 12 phoneme(s) to ids: ˈæŋɡɹi fˈeɪs
[2024-06-02 00:57:24.878] [piper] [debug] Converted 12 phoneme(s) to 27 phoneme id(s): 1, 0, 120, 0, 39, 0, 44, 0, 66, 0, 88, 0, 21, 0, 3, 0, 19, 0, 120, 0, 18, 0, 74, 0, 31, 0, 2,
[2024-06-02 00:57:24.878] [piper] [debug] Synthesizing audio for 27 phoneme id(s)

Should I modify how the phonemizing process occurs inside of piper-phonemize itself ? Do I need to make this type of UTF-8 character generate blank sound ? How can I achieve that ? Or maybe it's in the training process by adding these phonemes in config.json? But still, it will not use them as they are because it will first phonemize them, and they will translate into phonemes like "æŋɡɹi fˈeɪs".

Jun 01 '24 23:06 Haurrus

@Haurrus Could you please provide the dataset? Maybe there are some ways to make a piper with emotion prompt.

Jan 11 '25 22:01 TPODAvia

I was thinking the same about emotions and or expressions. I don't know how it will be handled in Piper.

Jan 12 '25 09:01 luiscarlos2000

I was thinking the same about emotions and or expressions. I don't know how it will be handled in Piper.

An easy way, not requiring any metatags or SSML codes, is having different datasets, training different voices.

Jan 12 '25 09:01 FrontierDK

I'm closing this issue since currently piper won't add emotions/expressions. Moving to f5-tts.

Feb 01 '25 14:02 gab-luz