TTS [Bug] The voice-cloned speaker continues with garbage after to-be-spoken text was finished or mid-sentence

Describe the bug

Sometimes the speech pauses then the speaker continues but it's neither written nor is it any language, but it's clearly the same speaker. Unless you want to create a horror movie with a disturbingly familiar voice, this behaviour is undesired. I think bark has the same issue.

To Reproduce

device = "cuda" if torch.cuda.is_available() else "cpu"
was = 'tts_models/multilingual/multi-dataset/xtts_v2'
tts = TTS(model_name=was).to(device)
tts.tts_to_file(text="Some longer text", speaker_wav="some.wav", language="de", file_path="some-output.wav")

Expected behavior

Only speak what's being written.

Feb 11 '24 14:02 Bardo-Konrad

Anyone has a workaround to this?

I tried finishing all my text with a period "." but that does not stop the synthesizer from ending. Often there are artifacts along with the input text.

Feb 29 '24 22:02 kaveenkumar

Anyone has a workaround to this?

I tried finishing all my text with a period "." but that does not stop the synthesizer from ending. Often there are artifacts along with the input text.

Probably the only way around it is to generate speech, use speech to text, compare to input get timestamps of gibberish, remove, resave.

Kinda dumb, but what the heck.

Mar 01 '24 22:03 Bardo-Konrad

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

Apr 22 '24 05:04 stale[bot]

I want to draw attention to this.

Jun 29 '24 09:06 Bardo-Konrad

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

Aug 02 '24 02:08 stale[bot]

Anyone has a workaround to this? I tried finishing all my text with a period "." but that does not stop the synthesizer from ending. Often there are artifacts along with the input text.

Probably the only way around it is to generate speech, use speech to text, compare to input get timestamps of gibberish, remove, resave.

Kinda dumb, but what the heck.

I am thinking of implementing this.. However, instead of gathering timestamps for gibberish (we don't know this variable) which is complex to execute, I would prefer to gather timestamps for the input text (we know this variable) and crop + save only this timestamp

Aug 08 '24 13:08 kaveenkumar

TTS TTS copied to clipboard

[Bug] The voice-cloned speaker continues with garbage after to-be-spoken text was finished or mid-sentence

Describe the bug

To Reproduce

Expected behavior

TTS
TTS copied to clipboard