TTS icon indicating copy to clipboard operation
TTS copied to clipboard

[Bug] The voice-cloned speaker continues with garbage after to-be-spoken text was finished or mid-sentence

Open Bardo-Konrad opened this issue 1 year ago β€’ 7 comments

Describe the bug

Sometimes the speech pauses then the speaker continues but it's neither written nor is it any language, but it's clearly the same speaker. Unless you want to create a horror movie with a disturbingly familiar voice, this behaviour is undesired. I think bark has the same issue.

To Reproduce

device = "cuda" if torch.cuda.is_available() else "cpu"
was = 'tts_models/multilingual/multi-dataset/xtts_v2'
tts = TTS(model_name=was).to(device)
tts.tts_to_file(text="Some longer text", speaker_wav="some.wav", language="de", file_path="some-output.wav")

Expected behavior

Only speak what's being written.

Bardo-Konrad avatar Feb 11 '24 14:02 Bardo-Konrad

Anyone has a workaround to this?

I tried finishing all my text with a period "." but that does not stop the synthesizer from ending. Often there are artifacts along with the input text.

kaveenkumar avatar Feb 29 '24 22:02 kaveenkumar

Anyone has a workaround to this?

I tried finishing all my text with a period "." but that does not stop the synthesizer from ending. Often there are artifacts along with the input text.

Probably the only way around it is to generate speech, use speech to text, compare to input get timestamps of gibberish, remove, resave.

Kinda dumb, but what the heck.

Bardo-Konrad avatar Mar 01 '24 22:03 Bardo-Konrad

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

stale[bot] avatar Apr 22 '24 05:04 stale[bot]

I want to draw attention to this.

Bardo-Konrad avatar Jun 29 '24 09:06 Bardo-Konrad

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

stale[bot] avatar Aug 02 '24 02:08 stale[bot]

Anyone has a workaround to this? I tried finishing all my text with a period "." but that does not stop the synthesizer from ending. Often there are artifacts along with the input text.

Probably the only way around it is to generate speech, use speech to text, compare to input get timestamps of gibberish, remove, resave.

Kinda dumb, but what the heck.

I am thinking of implementing this.. However, instead of gathering timestamps for gibberish (we don't know this variable) which is complex to execute, I would prefer to gather timestamps for the input text (we know this variable) and crop + save only this timestamp

kaveenkumar avatar Aug 08 '24 13:08 kaveenkumar