LibreChat
LibreChat copied to clipboard
š£ļø feat: STT & TTS
Summary
For STT, press the button or use Shift + Alt + L
For TTS, press the button (if you hold the click, you can download the audio file)
checklist
STT
- [x] Browser
- [x] OpenAI Whisper
- [x] Local Whisper (tested on LocalAI and HomeAssistant Whisper)
- [ ] Azure Whisper (not tested yet but it should work)
- [x] All the OpenAI compatible STT
TTS
- [x] Browser
- [x] Elevenlabs
- [x] OpenAI TTS
- [x] Piper
- [x] Coqui
- [x] All the OpenAI compatible TTS
TODO:
- [x] ~~fix hark š¤~~
- [x] improve STTBrowser error handling
- [ ] handle audio files in the file upload and automatically transcribe them
UI
Speech TAB Explanation
NOTE: This is an explanation of how the automatic conversation works. To use it, you need to enable all of the settings in the Speech tab. This feature is still in beta, and sometimes it may not work as expected. Right now, after the AI input, I'm still not triggering the TTS call
graph TD;
UserRequest((User Requests STT)) --> CheckLocalStorage{Check Local Storage for Engine};
CheckLocalStorage -->|Engine Browser| AutomaticBrowser((Automatic Browser STT));
CheckLocalStorage -->|Engine External| ExternalCheck{Check Transcription Status};
ExternalCheck -->|Transcription Active| StopTranscription;
ExternalCheck -->|Transcription Inactive| ListenAudio((Listen to User Audio));
ListenAudio --> CheckAudio{Check Audio Level};
CheckAudio -->|Below Threshold| SaveAudio;
CheckAudio -->|Above Threshold| ContinueRecording;
SaveAudio --> DataProviderRequest((Data Provider Request));
DataProviderRequest --> APICall("/api/files/stt");
APICall -->|Success| SetText((Set Text in Text Area));
SetText -->|Auto Send Text Enabled| AutoSendRequest((Auto Send Text Request));
AutoSendRequest --> APICall2("/chat/completions");
APICall2 -->|Success| TriggerTTS((Trigger TTS));
TriggerTTS --> TTSRequest((TTS Request));
TTSRequest --> APICall3("/api/files/tts");
APICall3 -->|Success| PlayAudio((Play Audio));
PlayAudio -->|Playback Finished| WaitTwoSeconds;
WaitTwoSeconds --> RepeatSTT((Repeat STT Trigger));
subgraph Loop
RepeatSTT --> ListenAudio;
end
StopTranscription((Stop Transcription));
thank you @bsu3338 for the integrated browser STT & TTS thank you @szkiu for the Azure STT #2025
Change Type
- [x] New feature (non-breaking change which adds functionality)
- [x] This change requires a documentation update
Testing
Checklist
- [x] My code adheres to this project's style guidelines
- [x] I have performed a self-review of my own code
- [x] I have commented in any complex areas of my code
- [x] I have made pertinent documentation changes
- [x] My changes do not introduce new warnings
- [x] I have written tests demonstrating that my changes are effective or that my feature works
- [x] Local unit tests pass with my changes
- [x] Any changes dependent on mine have been merged and published in downstream modules.
@Berry-13 Thank-you for finishing it. I was just about to take another look at it, but am glad you are. Congrats to the whole team on the github trending!
@Berry-13 Thank-you for finishing it. I was just about to take another look at it, but am glad you are. Congrats to the whole team on the github trending!
you're welcome š
Is this still in draft?
Is this still in draft?
yes
Good that you added azure stt from @szkiu !
Now what are we waiting to merge this?
Any way we can help?
Good that you added azure stt from @szkiu !
Now what are we waiting to merge this?
Any way we can help?
there are still some things to add, fix merge conflicts, add docs, and also fix the TTS since right now it's not sending correctly the buffer to the client
Would be nice to add deepgram.io , it has whisper models as well, but it's much faster and 200 minutes per month free
Would be nice to add deepgram.io , it has whisper models as well, but it's much faster and 200 minutes per month free
sure, I'll take a look at this but in another PR. I already want to add the Google cloud STT & TTS, I'll try to add this too
@Berry-13 Any update on this PR?
@Berry-13 Any update on this PR?
Hey there! the PR is all set to go! The only thing left is to test out the Azure Whisper feature, but I'm still waiting the key for that. Unfortunately, until I have it, there isn't much more I can do at the moment. @danny-avila mentioned he'll be reviewing it within the next few weeks
Great job @Berry-13
My initial comments are just from glancing at the code through github.
I will do a more thorough review once you make the changes and I pull down the code for testing.
Eagerly waiting for this PR
Same here. Would really help orgs that cater to employees with disabilities or ADA compliance requirements
P.S. Adding Deepgram support earns automatic canonization
Is this possible to integrate whisper load locally or using inference framework such as triton inference server?
Is this possible to integrate whisper load locally or using inference framework such as triton inference server?
oops, not sure why I missed this ping. You can run Whisper locally with LocalAI and then pass the URL into the librechat.yaml
file
Eagerly waiting for this PR
I built this branch and have the button there as shown in the pic and i can speak and it outputs to the console correctly. However, I can figure out how to get ElevenLabs api working so that my ai can talk back. I see the button to press under the message but it has no effect.. Can you give me some directions on how to finish getting this going?? Thank you!
Benefit from merging the feature now to main > Benefit from waiting 1 more month to add new features
Benefit from merging the feature now to main > Benefit from waiting 1 more month to add new features
thats no fun man... Lol. I guess i hear you though. Props on this feature however!. Yall are doing a good job with this project.
PS. Is there any forms of TTS that is working you could give me a hint on? Even if they are beta solutions
Benefit from merging the feature now to main > Benefit from waiting 1 more month to add new features
The benefit of the current plan is less maintenance and work on a new feature which would delay planned updates.
I also advise against merging this into a fork because there are changes yet to be done in this PR
Thank you @berry-13 for continuing to work on this
@kneelesh48 @mf @xixingya @Fakamoto
hello you four. i don't want (didn't want) to wait any longer either, so i downloaded the branch locally, built it myself and imported the image into my docker system. done.
if any of you use docker, i have uploaded the image to a filehoster (unfortunately it only works for 21 days and max. 50 downloads). So if any of you have no idea how docker images work but would like to use this wonderful tts/sst function and have it running on docker: Here you can download the latest TTS/SST Librechat image that I have created. (note that this image will NOT be updated and should only be a temporary solution until danny merges).
I hope I can help some of you.
I can say: TTS/SST works great!
For those who want to do it themselves:
git clone -b Speech-to-Text https://github.com/danny-avila/LibreChat.git
cd LibreChat
(apt install docker.io)
docker build -t librechat_tts .
(docker save --output /path/to/librechat_tts.tar librechat_tts)
...and that's it. you only have to do the last step in () if you want to export the image. you will then get the same image as you get from my link above.
(berry and danny, if this is a problem please delete this comment)
have a nice day guys :)
@kneelesh48 @mf @xixingya @Fakamoto
hello you four. i don't want (didn't want) to wait any longer either, so i downloaded the branch locally, built it myself and imported the image into my docker system. done.
if any of you use docker, i have uploaded the image to a filehoster (unfortunately it only works for 21 days and max. 50 downloads). So if any of you have no idea how docker images work but would like to use this wonderful tts/sst function and have it running on docker: Here you can download the latest TTS/SST Librechat image that I have created. (note that this image will NOT be updated and should only be a temporary solution until danny merges).
I hope I can help some of you.
I can say: TTS/SST works great!
For those who want to do it themselves:
git clone -b Speech-to-Text https://github.com/danny-avila/LibreChat.git cd LibreChat (apt install docker.io) docker build -t librechat_tts . (docker save --output /path/to/librechat_tts.tar librechat_tts)
...and that's it. you only have to do the last step in () if you want to export the image. you will then get the same image as you get from my link above.
(berry and danny, if this is a problem please delete this comment)
have a nice day guys :)
thanks
PS. Is there any forms of TTS that is working you could give me a hint on? Even if they are beta solutions
thanks! yes,
Local:
- TTS: Piper
- STT: Whisper-Base (LocalAI)
External (paid):
- TTS: ElevenLabs
- STT: OpenAI Whisper
@berry-13 when do you think you will be completely finished with this pr so that it is ready to merge?
@berry-13 when do you think you will be completely finished with this pr so that it is ready to merge?
when I commit, it means the changes are ready for merging. But since @danny-avila mentioned he's going to refactor and fix some things, I'll continue until he begins reviewing it. Besides, I'll be working with him to ensure the Conversation Mode works properly since it's only partially functional at the moment
@berry-13 have you added support for Azure and GCP TTS in this PR? Those are the OG TTS models. Also, eleven labs is expensive and I don't like their subscription pricing model.
@berry-13 have you added support for Azure and GCP TTS in this PR? Those are the OG TTS models. Also, eleven labs is expensive and I don't like their subscription pricing model.
I personally use Elevenlabs. It has websocket support and one of the best TTS models out there. I can't add Azure TTS because I don't have a key (I can't). Google TTS is planned, and I'm working on adding support for multiple providers. I'll also be adding some other providersĀ inĀ theĀ future
@berry-13 I can provide you an azure key