IMS-Toucan
IMS-Toucan copied to clipboard
Read text from a specific speaker seen during training/finetuning ?
Hi,
I merged all my single speaker datasets into a bigger one and finetuned the Meta model on it. Now when inferencing the output sounds like a mixture of all 5 speakers. I need a text to be read by a given speaker.
Looking at the code the speaker name does not seem to be stored anywhere so I don't see an easy way to recall it at inference time.
There is actually a "voice seed" argument in control controllability interface but it is not related to a speaker index (eg 1 is speaker A, 2 is speaker B,...).
So I tried to use a specific speaker by providing a 'speaker_reference' to read_text which is straight forward. But in the end this does not sound like the reference speaker. My aim is to reproduce https://huggingface.co/spaces/Flux9665/SpeechCloning, how can I do that ?
Any advice greatly appreciated😊
All speaker cloning is done in a 0-shot manner, so we never learn the identity of a speaker anywhere, just the speaker embedding as conditioning signal. The voice seed is for a new feature we added, where those speaker embeddings can be randomly generated from a separate model.
Your approach of providing a reference audio as the speaker_reference argument is correct. This will only affect the voice and not so much the prosody. The Speech Cloning approach also uses a reference of someone speaking the desired sentence. The prosody is extracted from the reference spoken sentence phoneme by phoneme (so it has to be an exact match) but then re-synthesized with a different voice. It is basically voice conversion.
The model is not very good at mimicking speaking styles, just the voice of a speaker. So providing a reference audio as speaker_reference is the best the model can do for now.
Thanks for those details. I don't know why but I am still disappointed by voice conversion (I also tried freeVC implementation ready to use in Coqui). I will try to reproduce your results from the demo page by reusing the demo wavs. I'll first check that I get the same results as yours !