continue feat: Speech-to-Text Voice Input (via Whisper)

feat: Speech-to-Text Voice Input (via Whisper)

Open mkummer225 opened this issue 1 year ago • 22 comments

Description

This PR adds voice input functionality via Whisper Speech-to-Text models.

A microphone button is added to the InputToolbar component next to the image/file input button (seen below in screenshots). After clicking the button, voice input will be started and the user's microphone data will begin to be processed by the backend.

This PR includes with it the whisper-tiny.en quantized model and was tested on a mid-2015 base model MacBook Pro via cpu only inference.

More Details

The final implementation I settled on after a bunch of testing was to process fixed windows of audio (~1.5 seconds right now) but intelligently maintain a larger window if the speaker is still speaking at the end of the current window (we use silero-vad to detect this); this produced the best results as it accumulates speech, giving whisper much longer context into what the user's saying. I'm sure the parameters here could be fine tuned based on more real world usage.

The frontend will show initial results while the user is still speaking and will update them as whisper is able to process the rest of the user's speech. After the user is done speaking, it will 'commit' that text to the frontend and the audio buffer will be cleared for the next speech segment.

Still Todo

allow users to specify different whisper models
add gpu support

Checklist

[x] The base branch of this PR is dev, rather than main
[ ] The relevant docs, if any, have been updated or created

Screenshots

Voice input active shown by new yellow-orange gradient

New voice input button in InputToolbar

Testing

Pull and debug this branch, there is no specific configuration option to enable this feature. For now, only the tiny.en model is used and packaged with continue. Once firing up the dev environment, simply initiate a new chat and click the small microphone button in the toolbar to start voice input. You may be prompted by your system to allow microphone access – once provided, you should be able to speak and see your speech transcribed live into the chat box.

In the future, support may be added for the various whisper models or other STT providers.

Aug 20 '24 04:08 mkummer225

continue continue copied to clipboard

feat: Speech-to-Text Voice Input (via Whisper)

Description

More Details

Still Todo

Checklist

Screenshots

Testing

continue
continue copied to clipboard