tortoise-tts
tortoise-tts copied to clipboard
Multiple speakers defined in input text
Is it possible to define multiple speakers for different portions of the input text that you feed to read.py?
Maybe via SSML syntax or, but I'm dreaming here, with natural language inside brackets (e.g., [Tom speaks:])?
As far as I know this isn't an existing feature and there are no plans to implement SSML, but what you're describing is fairly-straightforward to achieve: just pre-process the text to match the speaker's utterances to their loaded voice, then generate the speech independently in the correct order and combine it (as in tortoise/read.py).
That being said, since the prompt affects the voice, there will be a lot of variation for the same speaker, which makes tortoise nearly impossible to work with for long inputs. You can change the params to make the voice more consistent, but then it also becomes bland.