nerd-dictation
nerd-dictation copied to clipboard
New possible output method and tts doubts
So I want to respeak my live recorded speech.
That means: mic -> text -> sound. Or in another words: Speech to Text
and then Text to Speech
.
The part for converting sounds from the microphone to text I achieve it thanks to nerd-dictation. The part for converting text to sound again I want to implement it thanks to festival.
1 - I have sort of added a new output method to nerd-dictation. I call it file
because it's meant to go into a file.
My current work can be found at https://github.com/ruckard/nerd-dictation/tree/speech_to_file_v2 . As you can see I have not added a new option for this mode because I'm not sure if it's worth it.
The current way that I run nerd-dictation is like this:
./nerd-dictation begin --vosk-model-dir=/home/playg/vosk-models/vosk-model-small-es-0.22 --full-sentence --punctuate-from-previous-timeout 1 --idle-time 0.5 --continuous --timeout 0.5 --output=STDOUT > /tmp/output_test_file.txt
.
Then I just tail -f /tmp/output_test_file.txt
.
2 - The current changes ( https://github.com/ruckard/nerd-dictation/commit/5acbd5468a294657b14ee5d832cc266afeb03c63 ) abuse the timeout option so that instead of exiting the program it process the audio again and gives me another sentence. It also makes sure not to output new text if there is nothing else said.
The idea is to read every line (after \n
is issued) and reproduce it thanks to festival.
3 - Anyways in the end I have three questions for you:
- Do you want me to send you a new file output mode which works as described pull request so that it gets added in the upstream project?
- Would you accept a pull request about a new functionality that converts the text back to sound thanks to festival (or espeak-ng or similar tool)?
- Do you know any other project that already does what I'm trying to do?
Thank you very much for your feedback.
Hey @ruckard this sounds interesting...
There are a few points I'll make in reply.
- A problem with STDOUT at the moment is it prints everything at once, instead of printing and flushing output as you talk. This makes it OK for short commands but unusable for continuous speech that gets processed.
- A solution that improves STDOUT to flush text output when sentences are completed would be greatly appreciated.
- If the STDOUT output is working well I'm not sure there is a need for a separate file output option although if there is a compelling argument I don't see a problem with it either.
- Do you want me to send you a new file output mode which works as described pull request so that it gets added in the upstream project?
That'd be great, although I might request some changes based on my previous comments.
- Would you accept a pull request about a new functionality that converts the text back to sound thanks to festival (or espeak-ng or similar tool)?
I don't think so, mainly because it seems like a specific use case that could be supported without integration (use pipes or even a custom configuration that calls external tools).
- Do you know any other project that already does what I'm trying to do?
No, although I've seen things like this mentioned before, out of interest why do you need this? :)
Hey @ruckard this sounds interesting...
There are a few points I'll make in reply.
* A problem with STDOUT at the moment is it prints everything at once, instead of printing and flushing output as you talk. This makes it OK for short commands but unusable for continuous speech that gets processed. * A solution that improves STDOUT to flush text output when sentences are completed would be greatly appreciated. * If the STDOUT output is working well I'm not sure there is a need for a separate file output option although if there is a compelling argument I don't see a problem with it either.
- Do you want me to send you a new file output mode which works as described pull request so that it gets added in the upstream project?
That'd be great, although I might request some changes based on my previous comments.
My current work (which already includes the reread functionality to convert the text back to sound) can be found at: https://github.com/ruckard/nerd-dictation/tree/reread_v4 .
If you could remove my text-to-speech part you would find:
- Adding a new mode (reread)
- Partial audio is discarded
- If reread mode is used then do not stop the engine when a sentence has been found (timeout is reached) but just print/speak it.
I might revisit my code later to rewrite it as you want to (another method that just prints to stdout). Obviously I would have to craft a companion python script to then read from stdin and speak what he's given.
I am currently fine with my reread_v4 branch but I know that it's better to have this new mode integrated into upstream so that if nerd-dictation is updated I can enjoy those updates without my branch being broken.
Thank you for your feedback!
- Would you accept a pull request about a new functionality that converts the text back to sound thanks to festival (or espeak-ng or similar tool)?
I don't think so, mainly because it seems like a specific use case that could be supported without integration (use pipes or even a custom configuration that calls external tools).
I see.
- Do you know any other project that already does what I'm trying to do?
No, although I've seen things like this mentioned before, out of interest why do you need this? :)
I have found the comprise project and its associated Comprise Voice Transformer which happens to do what I want. Unfortunately it requires Nvidia GPUs to work and seems to be a bit overbloat.
I just want to remain as anonymous as possible in the Internet.