IMS-Toucan icon indicating copy to clipboard operation
IMS-Toucan copied to clipboard

Read text from a specific speaker seen during training/finetuning ?

Open Ca-ressemble-a-du-fake opened this issue 1 year ago • 2 comments

Hi,

I merged all my single speaker datasets into a bigger one and finetuned the Meta model on it. Now when inferencing the output sounds like a mixture of all 5 speakers. I need a text to be read by a given speaker.

Looking at the code the speaker name does not seem to be stored anywhere so I don't see an easy way to recall it at inference time.

There is actually a "voice seed" argument in control controllability interface but it is not related to a speaker index (eg 1 is speaker A, 2 is speaker B,...).

So I tried to use a specific speaker by providing a 'speaker_reference' to read_text which is straight forward. But in the end this does not sound like the reference speaker. My aim is to reproduce https://huggingface.co/spaces/Flux9665/SpeechCloning, how can I do that ?

Any advice greatly appreciated😊

Ca-ressemble-a-du-fake avatar Mar 14 '23 04:03 Ca-ressemble-a-du-fake

All speaker cloning is done in a 0-shot manner, so we never learn the identity of a speaker anywhere, just the speaker embedding as conditioning signal. The voice seed is for a new feature we added, where those speaker embeddings can be randomly generated from a separate model.

Your approach of providing a reference audio as the speaker_reference argument is correct. This will only affect the voice and not so much the prosody. The Speech Cloning approach also uses a reference of someone speaking the desired sentence. The prosody is extracted from the reference spoken sentence phoneme by phoneme (so it has to be an exact match) but then re-synthesized with a different voice. It is basically voice conversion.

The model is not very good at mimicking speaking styles, just the voice of a speaker. So providing a reference audio as speaker_reference is the best the model can do for now.

Flux9665 avatar Apr 06 '23 11:04 Flux9665

Thanks for those details. I don't know why but I am still disappointed by voice conversion (I also tried freeVC implementation ready to use in Coqui). I will try to reproduce your results from the demo page by reusing the demo wavs. I'll first check that I get the same results as yours !

Ca-ressemble-a-du-fake avatar Apr 06 '23 18:04 Ca-ressemble-a-du-fake