piper [Bug] Long text via cat makes the tts engine mumble after a while

Tested on Windows 10 64 bit and Piper 2023.9.9-1 prerelease. A ported cat was used.

When a text has around 6000 characters piper starts mumbling after reading a while. Command used was: $ cat demo.txt | piper -m model.onnx -f demo.wav

Model used was de_DE_thorsten_high.onnx Though i guess it can be tested with other models as well.

Sep 17 '23 09:09 domasofan

Hi,

Seems Coqui TTS seems to have a similar problem: https://github.com/coqui-ai/TTS/issues/2959

After speaking 3 minutes of text the voice starts to stutter or making other noises. Maybe it is a problem with neural voices. Though i don't know what it is that this makes it happen. Some say it might be that the neural system needs faster resetting. Maybe after every sentence. The sound that coqui tts makes sounds like someone having an attack or something. Piper seems to start to mumble and changing speech tempo. Maybe someone could look into it?

Maybe this bug is related to it as well: https://github.com/rhasspy/piper/issues/212

Greetings, Simon

Oct 03 '23 18:10 domasofan

Just tested with latest release 2023.11.14-2. Tested with a long 23:29 minute file. The problem has been fixed or fixed itself in parts. Mumbling/slurring happens randomly from time to time. First time after speaking more than 2 minutes. Happened here with the test file at the following times: 2:10-2:40, 11:59-12:14, 14:33-15:12.

The demo MP3 can be found here: https://cloud.hohenems.at/index.php/s/867FWLXRmwwQzCA

The demo text used is here: https://cloud.hohenems.at/index.php/s/ePpj7bDnzT7J58a

It seems to recover after 20-30 seconds.

Tested with the de_DE-thorsten-high voice.

Nov 15 '23 07:11 domasofan

I am running repeated invocations with de_DE-thorsten-high and about one in 5 times it speaks at double speed - not exactly mumbling, but everything is choppy and average word per minute doubles.

This is not only triggered by longer texts: It occurs with three sentence texts as well.

I found the issue with other voices, just not as frequently as with Thorsten.

May 06 '24 19:05 clort81

I guess this is not affected by cat input Seems this is resulting from piping raw output into ffmpeg or lame. If i just let it write into wav files directly i don't get any errors. de_DE-mls-medium seems to be affected most. this is a multi speaker model. i used speaker 1. Tested under Windows. Can't currently test on Linux.

May 08 '24 19:05 domasofan

Using piper-tts 1.2.0 (installed using pip3), the de_DE-mls-medium model produces unusable, outlandish sounding output in 90% of cases for me. Interestingly enough, the "Rainbow" test sentence from the samples web page always works.

Here is an example output of de_DE-mls-medium, speaker 3, speaking the sentences:

Dies ist ein Test.
Der Regenbogen ist ein atmosphärisch-optisches Phänomen, das als kreisbogenförmiges farbiges Lichtband in einer von der Sonne beschienenen Regenwand oder -wolke wahrgenommen wird.

de_DE-mls-medium.Testfiles.zip

Used Linux Mint 21.3 and this command

echo "Dies ist ein Test." | piper -m /home/matthias/.local/share/piper/voices/de_DE-mls-medium.onnx -s 3 --length-scale 1 -f /tmp/jingle.wav

to produce these (I have my voices in ~/.local/share/piper/voices/). Same result when not using --length-scale. Same result (gibberish) when using other speakers.

Jun 01 '24 05:06 Moonbase59

I don't speak german but can understand some words so i did few tests:

Test 1: echo "oktoberfest brezel eins zwei polizei guten tag" | piper -m de_DE-mls-medium.onnx -s 3 --length-scale 1 --output-raw | ffplay -f s16le -ar 22050 -i pipe: -autoexit

Test 2: echo "oktoberfest brezel eins zwei polizei" | piper -m de_DE-mls-medium.onnx -s 3 --length-scale 1 --output-raw | ffplay -f s16le -ar 22050 -i pipe: -autoexit

I ran the test picking random speakers, de_DE-mls-medium need a minimum of few words to work correctly, Test 1 sounds fine while Test 2 sound weird, going down is even worse.

I tested another model, de_DE-thorsten-medium, it work fine even with only one word.

Jun 01 '24 13:06 TikoTako

Yeah, it seems the multi-speaker (most versatile) voices don’t work correctly. de_DE-thorsten is always good.

Jun 12 '24 03:06 Moonbase59