galene icon indicating copy to clipboard operation
galene copied to clipboard

Feature request (Accessibility): Optional Speech-to-Text (STT) integration

Open TechnologyClassroom opened this issue 1 year ago • 3 comments

Adding speech to text to Galène would greatly help people that have trouble hearing. The whisper and vosk models can be self-hosted and run on low resource machines which could potentially pair well with Galène. livestream.sh is an example of nearly real-time transcription with whisper using only CPU.

I recognize that this could be a difficult item, but I do want to put it out there as it could have a great impact.

TechnologyClassroom avatar May 14 '24 14:05 TechnologyClassroom

I'm open to the idea, but I'd need to speak with the people who actually need the feature. In particular, I'd need to understand why they don't use a system-wide speech-to-text system.

I have spoken to visually impaired users of Galene, and they tell me that they use a system-wide screenreader, and therefore don't need TTS support in Galene itself, they just need the Galene UI to be accessible (which is apparently the case). Before implementing the feature you request, I need to understand whether hearing impaired users use a system-wide speech-to-text system, and, if they don't, why.

If the issue is that there are no good speech-to-text systems for free OSes, then in my opinion we should work on building one, rather than adding speech-to-text support to every single application.

jech avatar May 15 '24 12:05 jech

Those are good questions.

The technology exists today for free desktop OSes, but it is still in the developer skill-set range and not a user-friendly range. The above script could be run in a local terminal on old laptops and connected to the desktop-audio instead of the microphone to get a local live-transcription in near real-time. The terminal would need to be always on-top and and take up enough of the screen real-estate to be useful. The setup for local whisper models takes some command line experience which not everyone is familiar with. There is definitely work that could be done to make this process easier such as GUIs, packaging, and installers. On the mobile front, it is still in the very early stages and processing power could be an issue.

If you run an event that may or may not have hearing issues and supplying all of the technology yourself, the local whisper system would need to be configured on all of the desktop machines and someone would need to introduce how to get it started if and when it is needed. Individual system configurations would scale poorly in this scenario.

Jitsi Meet with Jigasi adds the optional functionality of transcription followed by option functionality of translation through LibreTranslate. Transcription would be the first step towards translation.

If the event organizer could get TTS working once on the conferencing system, then all users could benefit whether they needed TTS, prefer subtitles, or are not native language speakers. The TTS could be integrated into the chat system or some other intuitive way that does not leave the users switching between two windows, trying to balance the screen sizes to experience the chat to the fullest extent, waiting for a model to download before they can participate, or not being able to participate on their mobile device.

TechnologyClassroom avatar May 15 '24 13:05 TechnologyClassroom

Ah-ha, you're thinking of server-side TTS. Yes, that makes more sense.

I think this could by done by writing a separate client that connects to the Galene server and does TTS then publishes the resulting text in the chat. This could be run on any computer, which would avoid putting CPU intensive stuff on the Galene server.

Please don't hold your breath.

jech avatar May 16 '24 16:05 jech

I've got a very early prototype. On my laptop, it takes 290% CPU to transcribe a single stream in real time, and on the order of 500MB of RAM. That's using 2s segments and the ggml-base.en.bin model.

jech avatar Jul 29 '24 16:07 jech

I've done some more experimenting. The smallest model available "tiny_q5.1", runs in real-time on my laptop, but the quality is not useful (it mostly produces hallucinations). The base.en model almost runs in real-time (it drops runs of packets), and it occasionally produces useful data.

Unfortunately, whisper.cpp is not able to produce a transcript incrementally, so I'm chopping the audio into two-second segments which I then pass to whisper. Better results could probably be achieved by using a smarter segmentation strategy, or by using a TTS engine that is able to perform incremental transcription. See also https://github.com/ggerganov/whisper.cpp/issues/1976.

In order to continue working on this, I would need ssh access to a machine much more powerful than the ones I currently have access to.

jech avatar Jul 29 '24 19:07 jech

Increasing the segment size to 5s improves the results quite a bit. A better segmentation strategy is needed.

jech avatar Jul 29 '24 20:07 jech

After some more tweaking, it's sort of usable. Please see https://github.com/jech/galene-stt

jech avatar Jul 29 '24 22:07 jech