dsnote icon indicating copy to clipboard operation
dsnote copied to clipboard

Does there exist a way to stream STT output?

Open jrpilat opened this issue 6 months ago • 3 comments

Heya, thanks for your awesome work building this tool out. I'm just now getting started with it.

#159 describes a request to translate in a streaming fashion, before a pause is issued.

Is there a way to do this, but without the added step of translation? In other words, is it possible for SN to transcribe while I'm in the middle of speaking?

The use case is, I might speak an entire paragraph, and I'd like the STT to occur during this, for 2 reasons:

  1. While speaking, if transcribing is not occurring, processing power is wasted. <-- addressed by Intermediate Results
  2. While speaking, if transcribing is not occurring, time is wasted waiting for it to complete. <-- addressed by Intermediate Results
  3. While speaking, I would like feedback of what I said, via transcription, before I'm finished speaking.
  4. While speaking, if transcribing is not occurring, pauses must be inserted, wasting time with silence.

I have tried STT with "Intermediate Results"-enabled models such as Coqui and April-ASR, and the intermediate results show up in the bottom statusbar of Speech Note, but they aren't showing up in the actual destination (SN or other app) until I finish speaking.

I apologize if I'm missing something obvious. Thank you again for your time and help!

jrpilat avatar Jun 26 '25 15:06 jrpilat

Hi. Thank you for the question.

As you have already noticed, Coqui-STT, April-ASR (and Vosk also) are able to provide "intermediate results". It's almost like real-time transcription, but not quite. These "intermediate results" are not final and can change when the STT engine has more audio data to process. Sometimes whole sentences are changed. Speech Note does not add the "intermediate results" to the notepad area as this is not a final version of the transcription. This is the only reason for this. At the moment, there is no settings option to change this behavior.

From a practical point of view, only Whisper models are accurate enough for reliable use. Unfortunately, Whisper does not support streaming. It can only transcribe part of the audio data on its own and output the result. Unlike Vosk, it cannot improve and continue the transcription based on the next chunk of audio data.

While speaking, if transcribing is not occurring, pauses must be inserted, wasting time with silence.

Yes, that could be impractical. I could add an extra button to force transcription without waiting for silence. Would this be useful?

mkiol avatar Jun 26 '25 18:06 mkiol

Is it possible to change the duration of the pause, and commit-to/accept whatever the model had at that point? I'm playing around with an TTS and STT for UX with an LLM, where I might not have hands-on-keyboard, and it's alright for there to be some errors: I might be able to prompt the LLM to work around them.

Failing that, a global shortcut to commit/accept would be useful, and I could try to assign it to a button on a nontraditional input device (like a small wireless mouse).

Thanks again!

jrpilat avatar Jun 27 '25 04:06 jrpilat

Is it possible to change the duration of the pause,

Right now "silent timeout" is fixed but it would be useful to make it configurable. I can implement a new option for next release.

and commit-to/accept whatever the model had at that point? Failing that, a global shortcut to commit/accept would be useful

No, but I like the idea. This also can be added in the next version. Good suggestions!

mkiol avatar Jul 01 '25 17:07 mkiol