[Bug] Long text via cat makes the tts engine mumble after a while
Tested on Windows 10 64 bit and Piper 2023.9.9-1 prerelease. A ported cat was used.
When a text has around 6000 characters piper starts mumbling after reading a while. Command used was: $ cat demo.txt | piper -m model.onnx -f demo.wav
Model used was de_DE_thorsten_high.onnx Though i guess it can be tested with other models as well.
Hi,
Seems Coqui TTS seems to have a similar problem: https://github.com/coqui-ai/TTS/issues/2959
After speaking 3 minutes of text the voice starts to stutter or making other noises. Maybe it is a problem with neural voices. Though i don't know what it is that this makes it happen. Some say it might be that the neural system needs faster resetting. Maybe after every sentence. The sound that coqui tts makes sounds like someone having an attack or something. Piper seems to start to mumble and changing speech tempo. Maybe someone could look into it?
Maybe this bug is related to it as well: https://github.com/rhasspy/piper/issues/212
Greetings, Simon
Just tested with latest release 2023.11.14-2. Tested with a long 23:29 minute file. The problem has been fixed or fixed itself in parts. Mumbling/slurring happens randomly from time to time. First time after speaking more than 2 minutes. Happened here with the test file at the following times: 2:10-2:40, 11:59-12:14, 14:33-15:12.
The demo MP3 can be found here: https://cloud.hohenems.at/index.php/s/867FWLXRmwwQzCA
The demo text used is here: https://cloud.hohenems.at/index.php/s/ePpj7bDnzT7J58a
It seems to recover after 20-30 seconds.
Tested with the de_DE-thorsten-high voice.
I am running repeated invocations with de_DE-thorsten-high and about one in 5 times it speaks at double speed - not exactly mumbling, but everything is choppy and average word per minute doubles.
This is not only triggered by longer texts: It occurs with three sentence texts as well.
I found the issue with other voices, just not as frequently as with Thorsten.
I guess this is not affected by cat input Seems this is resulting from piping raw output into ffmpeg or lame. If i just let it write into wav files directly i don't get any errors. de_DE-mls-medium seems to be affected most. this is a multi speaker model. i used speaker 1. Tested under Windows. Can't currently test on Linux.
Using piper-tts 1.2.0 (installed using pip3), the de_DE-mls-medium model produces unusable, outlandish sounding output in 90% of cases for me. Interestingly enough, the "Rainbow" test sentence from the samples web page always works.
Here is an example output of de_DE-mls-medium, speaker 3, speaking the sentences:
- Dies ist ein Test.
- Der Regenbogen ist ein atmosphärisch-optisches Phänomen, das als kreisbogenförmiges farbiges Lichtband in einer von der Sonne beschienenen Regenwand oder -wolke wahrgenommen wird.
de_DE-mls-medium.Testfiles.zip
Used Linux Mint 21.3 and this command
echo "Dies ist ein Test." | piper -m /home/matthias/.local/share/piper/voices/de_DE-mls-medium.onnx -s 3 --length-scale 1 -f /tmp/jingle.wav
to produce these (I have my voices in ~/.local/share/piper/voices/). Same result when not using --length-scale. Same result (gibberish) when using other speakers.
I don't speak german but can understand some words so i did few tests:
Test 1:
echo "oktoberfest brezel eins zwei polizei guten tag" | piper -m de_DE-mls-medium.onnx -s 3 --length-scale 1 --output-raw | ffplay -f s16le -ar 22050 -i pipe: -autoexit
Test 2:
echo "oktoberfest brezel eins zwei polizei" | piper -m de_DE-mls-medium.onnx -s 3 --length-scale 1 --output-raw | ffplay -f s16le -ar 22050 -i pipe: -autoexit
I ran the test picking random speakers, de_DE-mls-medium need a minimum of few words to work correctly, Test 1 sounds fine while Test 2 sound weird, going down is even worse.
I tested another model, de_DE-thorsten-medium, it work fine even with only one word.
Yeah, it seems the multi-speaker (most versatile) voices don’t work correctly. de_DE-thorsten is always good.