IMS-Toucan icon indicating copy to clipboard operation
IMS-Toucan copied to clipboard

Why does utterance cloner need to know the reference transcription text ?

Open Ca-ressemble-a-du-fake opened this issue 1 year ago • 3 comments

Hi,

I tried the run_utterance_cloner and noticed very bad results when the transcription text does not match the reference audio.

In another project I tried (Coqui) that also does voice conversion they don't need the transcription text. So just out of curiosity I am wondering why this is needed :wink: !

Thank you in advance for your reply!

Ca-ressemble-a-du-fake avatar Mar 07 '23 15:03 Ca-ressemble-a-du-fake

The utterance cloner tries to clone the prosody, not only the voice. To clone the voice you just set the speaker embedding, no need for the utterance cloner. To clone the prosody of a target utterance, the system needs a reference of each prosodic value for each phoneme, i.e. pitch, energy and duration of every phone. The application for this is mostly voice masking/voice privacy. Maybe I should move this script somewhere else or rename it, multiple people semm to have confused the purpose of this script already.

Flux9665 avatar Mar 08 '23 15:03 Flux9665

Ah that makes sense for voice privacy purpose. Yet what would be the application of the biblical ensemble ? Is it just for fun ?

Ca-ressemble-a-du-fake avatar Mar 08 '23 20:03 Ca-ressemble-a-du-fake

Yes, the ensemble reading is just for fun. We thought if maybe it could be useful for fooling a speaker verification system, but with no success.

Flux9665 avatar Apr 13 '23 11:04 Flux9665