IMS-Toucan Why does utterance cloner need to know the reference transcription text ?

Why does utterance cloner need to know the reference transcription text ?

Open Ca-ressemble-a-du-fake opened this issue 1 year ago • 3 comments

Hi,

I tried the run_utterance_cloner and noticed very bad results when the transcription text does not match the reference audio.

In another project I tried (Coqui) that also does voice conversion they don't need the transcription text. So just out of curiosity I am wondering why this is needed :wink: !

Thank you in advance for your reply!

Mar 07 '23 15:03 Ca-ressemble-a-du-fake

The utterance cloner tries to clone the prosody, not only the voice. To clone the voice you just set the speaker embedding, no need for the utterance cloner. To clone the prosody of a target utterance, the system needs a reference of each prosodic value for each phoneme, i.e. pitch, energy and duration of every phone. The application for this is mostly voice masking/voice privacy. Maybe I should move this script somewhere else or rename it, multiple people semm to have confused the purpose of this script already.

Mar 08 '23 15:03 Flux9665

Ah that makes sense for voice privacy purpose. Yet what would be the application of the biblical ensemble ? Is it just for fun ?

Mar 08 '23 20:03 Ca-ressemble-a-du-fake

Yes, the ensemble reading is just for fun. We thought if maybe it could be useful for fooling a speaker verification system, but with no success.

Apr 13 '23 11:04 Flux9665

IMS-Toucan IMS-Toucan copied to clipboard

Why does utterance cloner need to know the reference transcription text ?

IMS-Toucan
IMS-Toucan copied to clipboard