Issue with Mumbling Voice at the beginning of the French TTS Output
I have encountered an issue while using the text-to-speech (TTS) functionality for the French language in the project. The generated output speech exhibits a mumbling voice at the beginning, which affects the overall audio quality. Interestingly, this problem does not occur when using English, Turkish, or German.
The models located in the paths "./piper-voices/tree/v1.0.0/fr/fr_FR/gilles/low/fr_FR-gilles-low.onnx" and "./piper-voices/tree/main/fr/fr_FR/mls_1840/low/fr_FR-mls_1840-low.onnx" from here. were tested and both have an inaudible speech at the beginning of the generated output file.
How to Regenerate the Issue:
-
Download the "piper-voices" from here and save it to the folder of your installed project.
-
Open the command prompt (cmd) and navigate to the directory where your Piper project is installed.
-
Run the following command:
echo 'Bienvenue dans le monde de la synthèse vocale !' | piper --model ./piper-voices/fr/fr_FR/gilles/low/fr_FR-gilles-low.onnx --output_file test.wav
FWIW, French voices siwis and upmc both medium quality rendered your sample text correctly here on Linux. Maybe a voice quality issue only present in the low quality version?
Hi @colbec. Thank you for your prompt response. You are right. The issue revolves around the subpar quality of the low-tier models, prompting me to switch to the medium-tier models for the French language.
I started to play with piper recently and I experience mumbling in high quality English model as well (en_US-ryan-high).
I started to play with piper recently and I experience mumbling in high quality English model as well (en_US-ryan-high).
Progress in science can only be made when an experimenter reports sufficient detail for others to be able to repeat the experiment and provide feedback. Please provide a short sample of text which reliably produces the mumbling you describe.
Alright, it happens often on large texts. It took me some time to reproduce it on a smaller text, but here it is: https://pastebin.pl/view/c080cc34
I reproduced it on a Ubuntu in a cloud provider and one running in a WSL on Windows.
Alright, it happens often on large texts. It took me some time to reproduce it on a smaller text, but here it is: https://pastebin.pl/view/c080cc34
I reproduced it on a Ubuntu in a cloud provider and one running in a WSL on Windows.
Okay, I took your file and ran it unedited through piper on my local machine, no cloud involved.
I generated a wav file by combining all the separate sentences separated by short silences into one large wav file and the result played using voice ryan-high using sox play and aplay and both played without error or mumbling. This was with normal speed and without making any edits to the text provided. I have a Julia script that sets up piper to generate the wav output but it is piper that does all the real work.
If I can provide you with other details that would help you to reproduce my approach let me know.
Well. I'm not sure what is the difference. I ran it with a pretty simple command:
cat small-text.txt | ./piper/piper --model en_US-ryan-high.onnx --output_file output.wav
I have uploaded a wav file to Google drive. There should be very little difference from yours?
Well mine has the issue that I am talking about. Here it is: https://drive.google.com/file/d/1t7r4PwXKUg57ucIYDjIvA94a_cfvfS0F/view?usp=drive_link
Yes, the garbling is definitely there. I hear the first few sentences play ok, then a few sentences play at super speed, then it slows to normal again for the last sentences. It does not seem to be related to any punctuation.
Here's a hypothesis - maybe threads or processes get muddled and output arrives at the wrong time. Alternately data is generated correctly but is given the wrong time frame, so then it tries to stuff a quart into a pint bottle.
The weird thing is, I have this issue on two really different computers and environments. I don't know what is this Julia script that you mentioned, but is it possible that it runs piper with some different options than my basic command?
My Julia script is designed to work with text input formatted is a special way so would not be of any help here. Since the text file consists of one line only containing multiple sentences it would default to reading it as one single string.
One difference is that your command uses a cat approach to read a file, mine splits up sentences into separate lines and uses echo approach to read a string.
It may be relevant to specify that I do not have a GPU so no calculations are offloaded from the CPU.
See also: https://github.com/rhasspy/piper/issues/211#issuecomment-2143309023 (German mls model producing gibberish)