continue
                                
                                 continue copied to clipboard
                                
                                    continue copied to clipboard
                            
                            
                            
                        feat: Speech-to-Text Voice Input (via Whisper)
Description
This PR adds voice input functionality via Whisper Speech-to-Text models.
A microphone button is added to the InputToolbar component next to the image/file input button (seen below in screenshots). After clicking the button, voice input will be started and the user's microphone data will begin to be processed by the backend.
This PR includes with it the whisper-tiny.en quantized model and was tested on a mid-2015 base model MacBook Pro via cpu only inference.
More Details
The final implementation I settled on after a bunch of testing was to process fixed windows of audio (~1.5 seconds right now) but intelligently maintain a larger window if the speaker is still speaking at the end of the current window (we use silero-vad to detect this); this produced the best results as it accumulates speech, giving whisper much longer context into what the user's saying. I'm sure the parameters here could be fine tuned based on more real world usage.
The frontend will show initial results while the user is still speaking and will update them as whisper is able to process the rest of the user's speech. After the user is done speaking, it will 'commit' that text to the frontend and the audio buffer will be cleared for the next speech segment.
Still Todo
- allow users to specify different whisper models
- add gpu support
Checklist
- [x] The base branch of this PR is dev, rather thanmain
- [ ] The relevant docs, if any, have been updated or created
Screenshots
Voice input active shown by new yellow-orange gradient
New voice input button in InputToolbar
Testing
Pull and debug this branch, there is no specific configuration option to enable this feature. For now, only the tiny.en model is used and packaged with continue. Once firing up the dev environment, simply initiate a new chat and click the small microphone button in the toolbar to start voice input. You may be prompted by your system to allow microphone access – once provided, you should be able to speak and see your speech transcribed live into the chat box.
In the future, support may be added for the various whisper models or other STT providers.