metavoice-src it would be interesting to let the model make other speech sounds. like laughing

bark also did this and it is quite helpful.

we could use semantics like this for the sounds. [laughter] [laughs] [sighs] [gasps] [clears throat] — or ... for hesitations ♪ for song lyrics CAPITALIZATION for emphasis of a word

maby other emotional words would also be interesting like sad / happy.

but it might be to much work. do you think it would be possible to add something like this though fine tuning?

Feb 10 '24 02:02 Manni1000

We will release finetuning code soon. Would love the community to push this work forward :) And we are ofc happy to assist along the way.

Feb 13 '24 11:02 sidroopdaska

And breathing. That scene from the movie "Her" :heart_eyes:

Feb 15 '24 10:02 l4b4r4b4b4

I've added some initial pointers to this here: https://github.com/metavoiceio/metavoice-src/issues/70#issuecomment-1957337895

Feb 21 '24 17:02 vatsalaggarwal

I totally agree with the need to hint and train towards non verbal sounds. I think taking a semantic like [laugh] or [sigh] or [shushing] is better than actual letters such as [hahaha] or [hhh] for a sigh, or even "Shhh" for shushing, because there tends to be bleeding between the concepts in this case. It would really be awesome to be able to infer emotions or reactions this way!

By the way I've just tested a simple cloning with only the base model, and I must say it is already quite good! Although I have a challenging speaker so there's still room for improvement, but I can't wait for finetuning to be out! Very promising, thank you!

Mar 01 '24 15:03 maepopi

I totally agree with the need to hint and train towards non verbal sounds. I think taking a semantic like [laugh] or [sigh] or [shushing] is better than actual letters such as [hahaha] or [hhh] for a sigh, or even "Shhh" for shushing, because there tends to be bleeding between the concepts in this case. It would really be awesome to be able to infer emotions or reactions this way!

By the way I've just tested a simple cloning with only the base model, and I must say it is already quite good! Although I have a challenging speaker so there's still room for improvement, but I can't wait for finetuning to be out! Very promising, thank you!

hmmm that could be one way to have the input include special token / words for those, or have a trainable preprocessing model insert them or simply have the TTS model learn them from the given audio sample. And I actually prefer the latter ;)

Mar 01 '24 17:03 l4b4r4b4b4

I totally agree with the need to hint and train towards non verbal sounds. I think taking a semantic like [laugh] or [sigh] or [shushing] is better than actual letters such as [hahaha] or [hhh] for a sigh, or even "Shhh" for shushing, because there tends to be bleeding between the concepts in this case. It would really be awesome to be able to infer emotions or reactions this way! By the way I've just tested a simple cloning with only the base model, and I must say it is already quite good! Although I have a challenging speaker so there's still room for improvement, but I can't wait for finetuning to be out! Very promising, thank you!

hmmm that could be one way to have the input include special token / words for those, or have a trainable preprocessing model insert them or simply have the TTS model learn them from the given audio sample. And I actually prefer the latter ;)

Hey! I still consider myself as a novice in the field, what you mean is that we should be able to caption the audios with the given sounds (like "haha", "shh", "hhh") and then train the model with this? Because that's what I've tried with this repo here (which is really good by the way and which is based on Tortoise TTS). When you want to finetune a model there you put audios and a json file with the retranscription of your audios, and then in your audio you label the non verbal sounds to teach the model to recognize them. It actually works very well, but not for all sounds, and I've been struggling with the sigh for example. I've tried labelizing it like "haaa" or "hhhh", but it often gets confused with "shh" or "haha". That is why I was thinking that using [laugh] or [sigh] instead of trying a litteral phonetic retranscription might work better.

What do you think?

Mar 01 '24 18:03 maepopi

Yeah, this would be great, and we would love to do this! We're focusing on a few more fundamental model improvements which would be hard for the community to manage, and I think folks over at https://github.com/metavoiceio/metavoice-src/issues/70 are close to having the finetuning working... We can try it with that once that is up and running!

It's hard to say how well these things would work without looking at the data first, but I reckon having special tokens for "laughter" / "sigh" / etc might work better than using a prompt like "haha" or "shh"... if someone can share the data they're thinking of training with to get this working, I can comment more :)

Mar 04 '24 13:03 vatsalaggarwal

I can cook a little sample with some sentences with non verbal sounds, to show how I've been training until now and write a version of how I think it would be better to train it! Don't know if it'll help but I can try putting this out this week :)

Mar 04 '24 13:03 maepopi

@maepopi that would be awesome, and would help for sure!

Mar 04 '24 13:03 vatsalaggarwal

Do you need a specific amount of audios / JSON entries or just a few with an example of each non verbal sound is enough?

Mar 04 '24 14:03 maepopi

To have a look at, a few examples should be enough... for training, we'll probably need more!

Mar 04 '24 14:03 vatsalaggarwal

Okay because my dataset comes from a video game / an audiobook and thus its not free of rights so I don't know if I can share it in full here

Mar 04 '24 14:03 maepopi

feel free to email me [email protected] with whatever you can share / if you can share!

Mar 04 '24 14:03 vatsalaggarwal

Ok thanks! I'll see what I can do!

Mar 04 '24 14:03 maepopi

Hey @vatsalaggarwal, I have sent you a small dataset with two JSONS, one with phonetic retranscription, and another with token retranscription. As I said in my email, I'll sum up my thoughts here for the others to be able to jump in.

While retranscribing with tokens such as [sigh], [laugh], I have quickly noticed that sometimes it might actually be better to transcribe phonetically. I'm thinking about sentences such as : "Ah, there you are!"

or

"Oh, really?"

Where "Ah" and "Oh" actually act more like words than non verbal sounds.

There are also a lot of cases where you might want some control on the sound you want to generate. For instance, there are sighs that are longer than others, or that convey a different feeling : nostalgia, pain, or boredom for instance. Likewise for "Hmm", which can convey thinking but also relishing something you're eating. In these cases, maybe it would be a good idea to give more nuance to the token, with options such as [pained sigh] or [nostalgic sigh], but that might result in confusing the model more, especially if it results in having just a few isolated examples in the whole dataset. Like you'd technically be able to have ten [sigh] tokens, but if you choose to distinguish between them, you might find yourself with ten new individual concepts to train, which won't have much representation elsewhere in the dataset . On the other hand, gathering all these sounds under the token [sigh] might result more often in generating an actual sigh, but you would lose a lot of control on the type of sigh you want.

Maybe a way to fix this would be to also add emotion tokens, such as [sad] or [happy], that sort of thing, and combine it with the non verbal token. That is something Tortoise-TTS model does. You can write [I am really sad,] at the beginning of your prompt, and it will result in the model trying to give a sad intonation to the generated sentence. For emotions or a loud intonation, I also tried to label the words in capital letters in Tortoise-TTS, and it seemed to work rather well, so that's something we could try and investigate upon as well. Maybe these could be LoRAS?

In the end, I think it might be best to make the model flexible to recognize both types of labeling : phonetic and token. This way, you can first try to train/ generate with a phonetic way, and if you see there's bleeding or the model doesn't capture the sound well, then you can try with tokens.

Anyway sorry if my thoughts are a bit messy, I just wanted to share them here as well because obviously all of this is very empiric on my side. Very excited to be part of the conversation though!

Mar 06 '24 12:03 maepopi

Sorry for the delay here @maepopi ... @lucapericlp should have the finetuning code (on top of @danablend's) out today, and we can take it further once that is done!

Mar 12 '24 11:03 vatsalaggarwal

No problem at all! Keep us posted, can't wait to see where you're going with this :)

Mar 12 '24 18:03 maepopi

Hey @l4b4r4b4b4 @maepopi @Manni1000, we just released an initial approach for finetuning the last N transformer blocks of the first stage LLM. Just a note that it'd be best to play around with the hyperparams in finetune_params.py as we didn't determine the optimal set (some people from the community were keen to contribute this portion). Let us know if you have any issues or if you're up for contributing any improvements (via param sweep or otherwise)!

Next step to improve finetuning effectiveness is to have LoRA adapters for the first stage LLM which is being worked on here.

Mar 14 '24 13:03 lucapericlp

Thank you so much! I'll try having a look at this this week end!

Mar 14 '24 14:03 maepopi

@lucapericlp does this approach support adding new tokens to the vocabulary?

Apr 25 '24 15:04 kabachuha

metavoice-src metavoice-src copied to clipboard

it would be interesting to let the model make other speech sounds. like laughing

metavoice-src
metavoice-src copied to clipboard