WritingTools icon indicating copy to clipboard operation
WritingTools copied to clipboard

Dictation/ Speech to Text - Superwhisper style

Open menelic opened this issue 1 year ago • 3 comments

Writing Tools is text-focused for good reasons, but current AI leads to a new usability for combining typed and spoken inputs due to the ever-increasing accuracy of dictation and transcription. The latest Gemini and OpenAI models have extremely low transcription latency, making speaking faster than typing for many use cases. Local models such as fasterWhisper offer nearly the same speed completely privately transcribed on local hardware. Even in text-heavy workflows, the ability to dictate prompts about what writing tools should do with a given text would greatly enhance usability. Instantaneous translation is another use case where transcription could come in handy.

Please consider implementing audio transcription to facilitate input. Ideally, there would be a second hot key with the crucial difference that there is an option to have the recording be active as long as the keys (e.g., S+space, r*t) are being pressed to create a sense of security around unintended recordings.

menelic avatar Feb 24 '25 12:02 menelic

Hi, since you are not the first one to ask for this feature, I think it's a good idea to add this. I'll start working on a prototype and keep you updated!

momokrono avatar Mar 04 '25 14:03 momokrono

Hello :) Sorry I didn't see this earlier. I've been having continuous exams this week and couldn't go through my whole inbox.

Thank you, @momokrono ! I agree that a feature like this would be quite exciting and super useful.

I did a little research and think we can flesh this out to be a pretty awesome and very useful feature (that no other program quite offers quite like this with Gemini transcription etc. currently!):

I think the best implementation would be to create a new "Dictation" feature for Writing Tools, which would allow one to:

1️⃣ Speak while either pressing & holding a Dictation hotkey, or choosing to start the recording on one press of it & end it on another press (we can implement fancy options/choices a little later).

2️⃣ We'll have the text transcribed with multiple provider options:

  • For now, to start off, Gemini 2.0 Flash is a no-brainer as it's free and really quite good & fast (we just need to send the audio in the API with the prompt Please **carefully** listen to this and transcribe this. Do not output anything but the transcription).
  • We will surely also implement a local option through the Whisper model. Whisper Large V3 Turbo is pretty much the best option for this right now & needs only ~1.5 GB of VRAM with great results.
  • In the future, we could also explore adding an API provider option for Deepgram, because it gives people $200 of free credit (750 hours of free transcription according to them) haha.

3️⃣ We simply paste the transcribed result into the textbox (which we already have functions for).

Our Settings page can have a separate top tab for Dictation, with the provider options just as we have for the current (and only) Writing tab. Dictation would be a whole new aspect, yet one that can immensely speed up "writing" so it'd be a great feature for us to implement!

@momokrono, let me know what you think about this plan, and I can help build most of it next week once my exams are over :D

theJayTea avatar Mar 04 '25 18:03 theJayTea

Hello all, have a look at Whispering, which probably has the best implementation of this feature currently.

DavidGP avatar Jul 09 '25 16:07 DavidGP