LibreChat Speech-To-Text and Text-To-Speech

This is a quick implementation of speech to text and text to speech for browsers with built-in support. It only works on Chrome based browsers and Safari for speech to text, but all browsers from what I can tell work with text to speech.

Shift-Alt-L - Enable Listening

Browser list: https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API#browser_compatibility

After digging into this, I would not mind adding both tts and stt with standard js without the react component. Just use https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API

Type of change

Please delete options that are not relevant.

[x] New feature (non-breaking change which adds functionality)
[x] This change requires a documentation update
[x] Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration:

go to LibreChat with a supported browser. If you see a mic next to the submit arrow then it should work. Most notable exception is Firefox

Test Configuration:

Checklist:

[x] My code follows the style guidelines of this project
[x] I have performed a self-review of my code
[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes
[x] Any dependent changes have been merged and published in downstream modules

Aug 04 '23 21:08 bsu3338

The only thing we may want to document is the supported browser list. The Mic will not appear in firefox.

Aug 04 '23 21:08 bsu3338

The only thing we may want to document is the supported browser list. The Mic will not appear in firefox.

I'm sure we can support this on firefox. Also, i would prefer if we don't add a second button and instead replace the submit button by the proposed methods below.

if it was activated by keyboard shortcut (it would initiate the TTS logic on press, first time would prompt browser mic permission) and long-pressing the submit button, maybe we can start with the first and work towards the second in another PR because i imagine the 2nd may be more involved.

Keyboard shortcut would be great and would at the same time spearhead the matching of what chat.openai.com now has with their helper icon

Aug 04 '23 21:08 danny-avila

Once it is activated through a shortcut, should we display a red circle on the bottom middle of the page while it is recording.

I was also looking at text to speech. Would that be presented the same way through a shortcut. If so, do we want to do notification while the audio is playing. Will we have shortcuts for play, pause, stop. Or if while it is playing a green circle is displayed and clicking it stops the audio.

Aug 04 '23 22:08 bsu3338

Once it is activated through a shortcut, should we display a red circle on the bottom middle of the page while it is recording.

I was also looking at text to speech. Would that be presented the same way through a shortcut. If so, do we want to do notification while the audio is playing. Will we have shortcuts for play, pause, stop. Or if while it is playing a green circle is displayed and clicking it stops the audio.

Sorry I keep confusing TTS with STT lol

I think for TTS it should be per message. And STT can replace the submit button with the recording indicator. Similar how it changes to 3 dots on official site when it’s generating text (which I may resurrect here)

Aug 04 '23 22:08 danny-avila

I converted the configuration to use the shortcut "Shift+ALt+L" L for listen, but feel free to change it. The submit button is changed into a recording icon when recording. Currently I auto submit the text when the person is done talking. This may need to be a user configuration option, because some users may want to correct the STT. However, chatgpt is pretty good with just knowing what you mean :). The browser usually prompts the user about about the mic being enabled, I may need help if we are needing to put-up a popup message to user or adding a help button on the bottom right hand corner of the screen. I also made the svg with ChatGPT so that may need to be updated if someone wants something better.

Aug 05 '23 06:08 bsu3338

I also added a countdown timer because the first second when I started speaking seemed to always be not picked up. However 3 seconds feels a little too long, but it seems we need at least 1 second. 2 may be good. I would hate to display the mic before it actually starts recording.

Aug 06 '23 00:08 bsu3338

Added Text to speech toggle with Shift+Alt+P. What are your thoughts about adding the ability to select text-to-speech voice settings within a preset. I did not show anything in the interface that text-to-speech was enabled. I did not know how you would want to handle that.

Aug 09 '23 06:08 bsu3338

Added Text to speech toggle with Shift+Alt+P. What are your thoughts about adding the ability to select text-to-speech voice settings within a preset. I did not show anything in the interface that text-to-speech was enabled. I did not know how you would want to handle that.

should not be within a preset. it makes more sense to toggle it without keyboard shortcuts to the settings dialog/modal

Aug 09 '23 14:08 danny-avila

going to test this later today when im in an environment to talk out loud :)

Aug 09 '23 14:08 danny-avila

My last minute change was just to prevent the mic from showing up in firefox and to put the keyboard shortcut for speechrecognition inside speechrecognition instead of TextChat.

Aug 09 '23 20:08 bsu3338

Sorry for the delay on this, but I have to say, really clean code and easy to read!

Did some testing just now and have some comments to make too

Aug 15 '23 05:08 danny-avila

I think STT works well enough that I'm happy to introduce it as is with some minor changes.

The main quirk i highlighted in the comment on SubmitButton. I think the click can be handled so that it cancels listening and leaves whatever is processed in the textarea.
More to this point, I don't like that it auto-submits, that's tokens automatically down the drain if I'm not happy with the processed input. Simply adding the processed speech to the textarea is huge, and not "pressing send" behaves like Apple's Siri.
would like to introduce this library to the project since it's small, no dependencies, for hotkeys. would help if you implement it now https://www.npmjs.com/package/hotkeys-js

For now I think these changes above would be good to introduce and the rest below to work in a future PR to prevent scope creep.

My main qualms are with TTS, though I understand this feature would be crucial for someone needing it.

it's not clear when it's activated and it exhibits weird behavior when doing multiple messages or toggling back and forth (which happened to me because i was unsure if it was on or not). The browser sdk will keep playing even if i refresh the page, and may continue for a while with no way to make it stop. This can be particularly weird and/or frustrating if I had no idea this was a feature and somehow activated it.

It's also not clear until you familiarize yourself with it that it will only read new incoming messages (and to a fault if you engage with the AI rapidly, as there will be a constant stream of audio if you keep messaging)

Also if you enable listening (STT) with TTS going, it's a weird time as STT will pick it up (I understand it's an edge case).

Rather than this being automatic, it might be better to add an optional volume icon on each message, where the regenerate, clipboard, edit icons live, (HoverButtons), that can be opted-in in the settings menu, so that I can toggle TTS on a per message basis, and stop it with the same button (with some alternating UI flavor, maybe volume icon with waves to enable, no soundwaves to disable)

an additional switch can be made for reading every new message, to keep the original functionality based on this setting, as I'm sure some would appreciate it.

I understand the intent for an automated experience but it was a bit of a runaway experience for me. My inspiration for the volume toggling comes from discord, because TTS is very deliberate there using slash commands: /tts read this out loud as well as a few other apps I've experienced it in.

How do you feel about disabling TTS for now and reworking it whenever you get the chance in a separate PR? We can keep the files but disable the functionality/event listening for now.

Aug 15 '23 06:08 danny-avila

I agree on the tts. It needs an icon to turn off while it is speaking. Are the icons below good options? I am personally not a fan of the robot voice and would love for a quality tts with something like piper (https://github.com/rhasspy/piper), but I know some would want to use ElevenLabs. I will look into adding the preferences. Should they all be under the general tab or create a new one for speech? I am good with disabling TTS by removing the shortcut key that enables it.

TTS Off https://lucide.dev/icons/volume-x TTS On https://lucide.dev/icons/volume-2

Aug 15 '23 19:08 bsu3338

Should they all be under the general tab or create a new one for speech? I am good with disabling TTS by removing the shortcut key that enables it.

If they could be in a new one, I think that might be best.

TTS Off https://lucide.dev/icons/volume-x TTS On https://lucide.dev/icons/volume-2

Those icons look good!

Aug 15 '23 23:08 danny-avila

@danny-avila I think I made all the suggested changes, except for not adding a shortcut screen.

Switched to hotkeys-js for Shift+Alt+L to enable Listen
Recordings do not auto submit, but waits for the user to click submit. I left the code commented out for a user preference in a future PR.
Added Text-To-Speech icons that allow for reading text or canceling a text as it is being read.
Disable the ability to click the mic button after key combination is pressed
Disabled the shortcut to auto Text-To-Speech

Future plan would be to add speech settings in a user settings.

NOTICE: I just did some testing and it seems my hotkey is not working to enable Recording. Switching to hotkeys was the last change I made, so I assume that is the source of the issue. My original testing must have been using a cached version. Converting it back to draft

Sep 04 '23 04:09 bsu3338

@danny-avila Wish it will be up soon

Nov 20 '23 13:11 mmw1984

Any updates on getting this merged?

Jan 23 '24 21:01 luandro

Any updates on getting this merged?

I'm working on it in #1603

Jan 23 '24 21:01 berry-13

Closing in favor of #1603

Jan 23 '24 22:01 danny-avila

有关的任何更新吗？

我正在#1603做这件事

That's great, I found that librechat has seen a recent surge in commits, but there are very few user-expected features.

tts will be highly anticipated！

Jan 25 '24 13:01 kuangxiaoye

有关的任何更新吗？

我正在#1603做这件事

That's great, I found that librechat has seen a recent surge in commits, but there are very few user-expected features.

tts will be highly anticipated！

This is partly because I've been wrapped up with finishing this update: https://x.com/lgtm_hbu/status/1749865914315530361?s=20

Once I finish this, I can focus on more general user features for the app.

Jan 25 '24 13:01 danny-avila

有关的任何更新吗？

我正在#1603做这件事

That's great, I found that librechat has seen a recent surge in commits, but there are very few user-expected features. tts will be highly anticipated！

This is partly because I've been wrapped up with finishing this update: https://x.com/lgtm_hbu/status/1749865914315530361?s=20

Once I finish this, I can focus on more general user features for the app.

Exciting!

Jan 26 '24 00:01 kuangxiaoye

LibreChat LibreChat copied to clipboard

Speech-To-Text and Text-To-Speech

Type of change

How Has This Been Tested?

Test Configuration:

Checklist:

LibreChat
LibreChat copied to clipboard