LibreChat 🗣️ feat: STT & TTS

Summary

For STT, press the button or use Shift + Alt + L

For TTS, press the button (if you hold the click, you can download the audio file)

checklist

STT

[x] Browser
[x] OpenAI Whisper
[x] Local Whisper (tested on LocalAI and HomeAssistant Whisper)
[ ] Azure Whisper (not tested yet but it should work)
[x] All the OpenAI compatible STT

TTS

[x] Browser
[x] Elevenlabs
[x] OpenAI TTS
[x] Piper
[x] Coqui
[x] All the OpenAI compatible TTS

TODO:

[x] ~~fix hark 🤔~~
[x] improve STTBrowser error handling
[ ] handle audio files in the file upload and automatically transcribe them

UI

Speech TAB Explanation

NOTE: This is an explanation of how the automatic conversation works. To use it, you need to enable all of the settings in the Speech tab. This feature is still in beta, and sometimes it may not work as expected. Right now, after the AI input, I'm still not triggering the TTS call

graph TD;

    UserRequest((User Requests STT)) --> CheckLocalStorage{Check Local Storage for Engine};
    CheckLocalStorage -->|Engine Browser| AutomaticBrowser((Automatic Browser STT));
    CheckLocalStorage -->|Engine External| ExternalCheck{Check Transcription Status};
    ExternalCheck -->|Transcription Active| StopTranscription;
    ExternalCheck -->|Transcription Inactive| ListenAudio((Listen to User Audio));
    ListenAudio --> CheckAudio{Check Audio Level};
    CheckAudio -->|Below Threshold| SaveAudio;
    CheckAudio -->|Above Threshold| ContinueRecording;
    SaveAudio --> DataProviderRequest((Data Provider Request));
    DataProviderRequest --> APICall("/api/files/stt");
    APICall -->|Success| SetText((Set Text in Text Area));
    SetText -->|Auto Send Text Enabled| AutoSendRequest((Auto Send Text Request));
    AutoSendRequest --> APICall2("/chat/completions");
    APICall2 -->|Success| TriggerTTS((Trigger TTS));
    TriggerTTS --> TTSRequest((TTS Request));
    TTSRequest --> APICall3("/api/files/tts");
    APICall3 -->|Success| PlayAudio((Play Audio));
    PlayAudio -->|Playback Finished| WaitTwoSeconds;
    WaitTwoSeconds --> RepeatSTT((Repeat STT Trigger));

    subgraph Loop
    RepeatSTT --> ListenAudio;
    end

    StopTranscription((Stop Transcription));

thank you @bsu3338 for the integrated browser STT & TTS thank you @szkiu for the Azure STT #2025

Change Type

[x] New feature (non-breaking change which adds functionality)
[x] This change requires a documentation update

Testing

Checklist

[x] My code adheres to this project's style guidelines
[x] I have performed a self-review of my own code
[x] I have commented in any complex areas of my code
[x] I have made pertinent documentation changes
[x] My changes do not introduce new warnings
[x] I have written tests demonstrating that my changes are effective or that my feature works
[x] Local unit tests pass with my changes
[x] Any changes dependent on mine have been merged and published in downstream modules.

Jan 20 '24 22:01 berry-13

@Berry-13 Thank-you for finishing it. I was just about to take another look at it, but am glad you are. Congrats to the whole team on the github trending!

Jan 20 '24 23:01 bsu3338

@Berry-13 Thank-you for finishing it. I was just about to take another look at it, but am glad you are. Congrats to the whole team on the github trending!

you're welcome 😊

Jan 20 '24 23:01 berry-13

Is this still in draft?

Mar 05 '24 15:03 danny-avila

Is this still in draft?

yes

Mar 05 '24 15:03 berry-13

Good that you added azure stt from @szkiu !

Now what are we waiting to merge this?

Any way we can help?

Mar 16 '24 03:03 Fakamoto

Good that you added azure stt from @szkiu !

Now what are we waiting to merge this?

Any way we can help?

there are still some things to add, fix merge conflicts, add docs, and also fix the TTS since right now it's not sending correctly the buffer to the client

Mar 16 '24 09:03 berry-13

Would be nice to add deepgram.io , it has whisper models as well, but it's much faster and 200 minutes per month free

Mar 25 '24 19:03 virtuman

Would be nice to add deepgram.io , it has whisper models as well, but it's much faster and 200 minutes per month free

sure, I'll take a look at this but in another PR. I already want to add the Google cloud STT & TTS, I'll try to add this too

Mar 25 '24 21:03 berry-13

@Berry-13 Any update on this PR?

Mar 28 '24 08:03 twmht

@Berry-13 Any update on this PR?

Hey there! the PR is all set to go! The only thing left is to test out the Azure Whisper feature, but I'm still waiting the key for that. Unfortunately, until I have it, there isn't much more I can do at the moment. @danny-avila mentioned he'll be reviewing it within the next few weeks

Mar 28 '24 12:03 berry-13

Great job @Berry-13

My initial comments are just from glancing at the code through github.

I will do a more thorough review once you make the changes and I pull down the code for testing.

Apr 01 '24 14:04 danny-avila

Eagerly waiting for this PR

Apr 06 '24 09:04 kneelesh48

Same here. Would really help orgs that cater to employees with disabilities or ADA compliance requirements

P.S. Adding Deepgram support earns automatic canonization

Apr 07 '24 00:04 mf

Is this possible to integrate whisper load locally or using inference framework such as triton inference server?

Apr 10 '24 01:04 luvwinnie

Is this possible to integrate whisper load locally or using inference framework such as triton inference server?

oops, not sure why I missed this ping. You can run Whisper locally with LocalAI and then pass the URL into the librechat.yaml file

Apr 12 '24 19:04 berry-13

Eagerly waiting for this PR

Apr 23 '24 02:04 xixingya

I built this branch and have the button there as shown in the pic and i can speak and it outputs to the console correctly. However, I can figure out how to get ElevenLabs api working so that my ai can talk back. I see the button to press under the message but it has no effect.. Can you give me some directions on how to finish getting this going?? Thank you!

Apr 23 '24 13:04 bpawnzZ

Benefit from merging the feature now to main > Benefit from waiting 1 more month to add new features

Apr 23 '24 18:04 Fakamoto

Benefit from merging the feature now to main > Benefit from waiting 1 more month to add new features

thats no fun man... Lol. I guess i hear you though. Props on this feature however!. Yall are doing a good job with this project.

PS. Is there any forms of TTS that is working you could give me a hint on? Even if they are beta solutions

Apr 23 '24 18:04 bpawnzZ

Benefit from merging the feature now to main > Benefit from waiting 1 more month to add new features

The benefit of the current plan is less maintenance and work on a new feature which would delay planned updates.

Apr 23 '24 19:04 danny-avila

I also advise against merging this into a fork because there are changes yet to be done in this PR

Apr 23 '24 19:04 danny-avila

Thank you @berry-13 for continuing to work on this

Apr 25 '24 20:04 danny-avila

@kneelesh48 @mf @xixingya @Fakamoto

hello you four. i don't want (didn't want) to wait any longer either, so i downloaded the branch locally, built it myself and imported the image into my docker system. done.

if any of you use docker, i have uploaded the image to a filehoster (unfortunately it only works for 21 days and max. 50 downloads). So if any of you have no idea how docker images work but would like to use this wonderful tts/sst function and have it running on docker: Here you can download the latest TTS/SST Librechat image that I have created. (note that this image will NOT be updated and should only be a temporary solution until danny merges).

I hope I can help some of you.

I can say: TTS/SST works great!

For those who want to do it themselves:

git clone -b Speech-to-Text https://github.com/danny-avila/LibreChat.git
cd LibreChat
(apt install docker.io)
docker build -t librechat_tts .
(docker save --output /path/to/librechat_tts.tar librechat_tts)

...and that's it. you only have to do the last step in () if you want to export the image. you will then get the same image as you get from my link above.

(berry and danny, if this is a problem please delete this comment)

have a nice day guys :)

Apr 25 '24 21:04 XHyperDEVX

@kneelesh48 @mf @xixingya @Fakamoto

hello you four. i don't want (didn't want) to wait any longer either, so i downloaded the branch locally, built it myself and imported the image into my docker system. done.

if any of you use docker, i have uploaded the image to a filehoster (unfortunately it only works for 21 days and max. 50 downloads). So if any of you have no idea how docker images work but would like to use this wonderful tts/sst function and have it running on docker: Here you can download the latest TTS/SST Librechat image that I have created. (note that this image will NOT be updated and should only be a temporary solution until danny merges).

I hope I can help some of you.

I can say: TTS/SST works great!

For those who want to do it themselves:
git clone -b Speech-to-Text https://github.com/danny-avila/LibreChat.git
cd LibreChat
(apt install docker.io)
docker build -t librechat_tts .
(docker save --output /path/to/librechat_tts.tar librechat_tts)
...and that's it. you only have to do the last step in () if you want to export the image. you will then get the same image as you get from my link above.

(berry and danny, if this is a problem please delete this comment)

have a nice day guys :)

thanks

Apr 26 '24 03:04 xixingya

PS. Is there any forms of TTS that is working you could give me a hint on? Even if they are beta solutions

thanks! yes,

Local:

TTS: Piper
STT: Whisper-Base (LocalAI)

External (paid):

TTS: ElevenLabs
STT: OpenAI Whisper

Apr 26 '24 08:04 berry-13

@berry-13 when do you think you will be completely finished with this pr so that it is ready to merge?

Apr 27 '24 21:04 XHyperDEVX

@berry-13 when do you think you will be completely finished with this pr so that it is ready to merge?

when I commit, it means the changes are ready for merging. But since @danny-avila mentioned he's going to refactor and fix some things, I'll continue until he begins reviewing it. Besides, I'll be working with him to ensure the Conversation Mode works properly since it's only partially functional at the moment

Apr 27 '24 21:04 berry-13

@berry-13 have you added support for Azure and GCP TTS in this PR? Those are the OG TTS models. Also, eleven labs is expensive and I don't like their subscription pricing model.

May 07 '24 17:05 kneelesh48

@berry-13 have you added support for Azure and GCP TTS in this PR? Those are the OG TTS models. Also, eleven labs is expensive and I don't like their subscription pricing model.

I personally use Elevenlabs. It has websocket support and one of the best TTS models out there. I can't add Azure TTS because I don't have a key (I can't). Google TTS is planned, and I'm working on adding support for multiple providers. I'll also be adding some other providers in the future

May 08 '24 14:05 berry-13

@berry-13 I can provide you an azure key

May 11 '24 16:05 kneelesh48

LibreChat LibreChat copied to clipboard

🗣️ feat: STT & TTS

Summary

checklist

TODO:

UI

Speech TAB Explanation

Change Type

Testing

Checklist

LibreChat
LibreChat copied to clipboard