IMS-Toucan
IMS-Toucan copied to clipboard
Why does utterance cloner need to know the reference transcription text ?
Hi,
I tried the run_utterance_cloner and noticed very bad results when the transcription text does not match the reference audio.
In another project I tried (Coqui) that also does voice conversion they don't need the transcription text. So just out of curiosity I am wondering why this is needed :wink: !
Thank you in advance for your reply!
The utterance cloner tries to clone the prosody, not only the voice. To clone the voice you just set the speaker embedding, no need for the utterance cloner. To clone the prosody of a target utterance, the system needs a reference of each prosodic value for each phoneme, i.e. pitch, energy and duration of every phone. The application for this is mostly voice masking/voice privacy. Maybe I should move this script somewhere else or rename it, multiple people semm to have confused the purpose of this script already.
Ah that makes sense for voice privacy purpose. Yet what would be the application of the biblical ensemble ? Is it just for fun ?
Yes, the ensemble reading is just for fun. We thought if maybe it could be useful for fooling a speaker verification system, but with no success.