piper
piper copied to clipboard
Run in server mode?
TLDR: Is there a way to keep the process running long-term, and "instantly" generate output whenever it receives text input?
I've been following the Rhasspy project for a long time, and I'm digging the Piper project. Great work! I'm integrating it into Voco, a voice control plugin for the Candle Controller (and Webthings Gateway).
One thing I'm running into is: whenever I generate text the model takes a second to load. This is a precious second.
To combat this I tried implementing JSON input, being under the false assumption that doing so would allow me to pipe text into Piper running as a Python Subprocess once in a while, on demand. I was hoping the model would stay loaded that way, so that audio generation could start as soon as possible.
Unfortunately, after piping text into the process it generates the audio, and then stops. I then have to restart Piper, which in the case of multiple sentences needing to be spoken in a row generates a second delay between each sentence.
Is there a way to keep the process running long-term, and "instantly" generate output whenever it receives text input?
This would have some other small advantages too.
- Instead of checking if there is enough memory to run Piper beforehand, which the code does now, the memory would only really need to be allocated once. This would make it more predictable / stable when my plugin uses the nice Piper voice, or when it has to fall back to
nanoTTS
for lack of free memory. This would also make it easier for users to predict if they have enough free memory to install other plugins. - Voco has two other LLM parts that already operate in such a 'server mode': Whisper for TTS, and Llamafile for the actual local chat assistant. Having all three processes run in such a server mode could make it attractive to code a single system for managing these long running processes (and restarting them if they crash, for example).
Even faster A related question: the LLM assistant generates output on a word-for-word basis. Currently I wait for a full sentence to be complete before I send it to Piper. Since (on a Raspberry Pi 5) the assistant generates text faster than Piper can speak ik, would it be possible to have a mode where Piper already starts generating speech if it has a buffer of just, say, 3 words? That might shave another second off the response time.
TLDR: Is there a way to keep the process running long-term, and "instantly" generate output whenever it receives text input?
If you are running on a Linux machine (possibly works on MacOS too), you can create a FIFO special file and pass it to piper
using the raw mode. Then open this file via Python to write data on it.
# run once on a terminal
mkfifo /tmp/piper-fifo
./piper --model en_US-danny-low.onnx --output-raw < /tmp/piper-fifo | aplay -q -D "default:USB" -r 16000 -f S16_LE -t raw -
# open another terminal and start python
import os
fd = os.open('/tmp/piper-fifo', os.O_WRONLY)
os.write(fd, b'hello from python\n')
Note that piper
will be blocked until the FIFO is open for writing by another process (i.e. python
in this case).
Yes, I looked into that as well. Thanks for the suggestion.
In the end I created a modified version of Piper that has the option to run in a loop. It's working great:
https://github.com/rhasspy/piper/pull/378
I'm currently using piper with my "select and speak" workflow on linux. I moved to a lower quality voice because it was taking a long time to start talking with the larger models. I suspect, although I didn't measure, that the initial load time is largely due to loading the initial model since there is no lag in speaking once it starts.
triggered using a hotkey:
#! /bin/bash
tts_pid=$(pidof piper)
voice="en_US-libritts_r-medium.onnx"
if [ -z "$tts_pid"]
then
xclip -out -selection primary | \
tr "\n" " " | \
~/.dotfiles/packages/piper/piper \
--model "$HOME/.dotfiles/packages/piper-models/$voice" \
--output-raw \
--length_scale 0.4 \
--sentence_silence 0.1 | \
aplay -r 22050 -f S16_LE -t raw -
else
kill $tts_pid
fi