SwiftWhisper icon indicating copy to clipboard operation
SwiftWhisper copied to clipboard

Real-time transcription

Open lucabeetz opened this issue 2 years ago • 66 comments
trafficstars

Hey, awesome package!

I wanted to ask how one could use this for on-device realtime description with microphone audio, similar to the objc example from the whisper.cpp package

lucabeetz avatar Apr 07 '23 10:04 lucabeetz

I'd also like to see an SFSpeechRecognizer-like API for easy replacement of SFSpeechRecognizer.

cerupcat avatar Apr 07 '23 20:04 cerupcat

+1 for this feature

libratiger avatar Apr 11 '23 00:04 libratiger

Yes, this would be a great feature.

fakerybakery avatar Apr 15 '23 22:04 fakerybakery

Right now, I'm setting a timer to start + stop the transcription every 2 seconds, however it's not that accurate because if a word is cut off, then whisper tries to improvise, and the text often has hallucinations.

fakerybakery avatar Apr 16 '23 22:04 fakerybakery

how would whiper officially support real time? The cut off issue is same for official library, correct? @fakerybakery

jacobjiangwei avatar Apr 23 '23 11:04 jacobjiangwei

how would whiper officially support real time? The cut off issue is same for official library, correct? @fakerybakery

The Whisper cpp repo has examples of how to implement realtime.

cerupcat avatar Apr 23 '23 18:04 cerupcat

I think that the whisper.cpp library stores some of the previous recording history and uses that to fix the cutoff issue, but I'm not sure.

fakerybakery avatar Apr 23 '23 18:04 fakerybakery

how would whiper officially support real time? The cut off issue is same for official library, correct? @fakerybakery

The Whisper cpp repo has examples of how to implement realtime.

thanks for point that out. Just curious, why it's in Obj-C but not in swift version...

jacobjiangwei avatar Apr 26 '23 02:04 jacobjiangwei

I don't know why, but if someone could port the example to Swift, I would really appreciate that (I'm really bad at Obj-C).

fakerybakery avatar Apr 26 '23 20:04 fakerybakery

I think that the whisper.cpp library stores some of the previous recording history and uses that to fix the cutoff issue, but I'm not sure.

Yep, I believe it does too – see this line (and line 245)

brytonsf avatar May 07 '23 23:05 brytonsf

Don't have a great understanding, but to me it looks like whisper.objc is storing the contents of a buffer when it fills up, then calling it's transcribe function against what it just stored, while clearing the buffer and re-enqueuing it. I don't know a ton about AVFAudio, but does anyone know if you could use AVAudioEngine and AVAudioPCMBuffer to create similar functionality? I'm thinking you could call Whisper.transcribe here with the buffer data if you can get that buffer data back from AVAudioEngine. Does anyone know if that would work?

barkb avatar May 11 '23 21:05 barkb

@barkb have you ever found a solution to this real-time idea?

ldenoue avatar Aug 16 '23 09:08 ldenoue

+1

moaljazaery avatar Aug 18 '23 05:08 moaljazaery

I found this Swift implementation of streaming: https://github.com/leetcode-mafia/cheetah/blob/b7e301c0ae16df5c597b564b2126e10e532871b2/LibWhisper/stream.cpp with a swift file inside a swit project. It's CC0 licensed.

I couldn't tell if it's uses the right config to benefit from the latest Metal/OpenML performance-oriented config and it uses some tool that requires a brew install so I don't know how sandbox friendly it is.

aehlke avatar Sep 15 '23 23:09 aehlke

I found this Swift implementation of streaming: leetcode-mafia/cheetah@b7e301c/LibWhisper/stream.cpp with a swift file inside a swit project. It's CC0 licensed.

I couldn't tell if it's uses the right config to benefit from the latest Metal/OpenML performance-oriented config and it uses some tool that requires a brew install so I don't know how sandbox friendly it is.

@aehlke

The linked app is an AI interview ... er assistant? and it listens to your audio and tries to respond with GPT-4 (it doesn't use SwiftWhisper). It uses the sdl12 library, which, according to their website:

... provide low level access to audio, keyboard, mouse, joystick, and graphics hardware via OpenGL and Direct3D ...

I haven't extensively researched this subject, but my interpretation is that this allows the app to listen to your system audio and transcribe it, so you don't have to install external software such as BlackHole. This leads me to believe that the library may not be necessary if the object is to listen from the microphone, which may mean that it can be run on other devices, such as iOS.

fakerybakery avatar Sep 16 '23 20:09 fakerybakery

@fakerybakery it looks to me like https://github.com/leetcode-mafia/cheetah/blob/b7e301c0ae16df5c597b564b2126e10e532871b2/LibWhisper/WhisperStream.swift has similarities to https://github.com/exPHAT/SwiftWhisper/blob/master/Sources/SwiftWhisper/Whisper.swift and the latter could be extended with that logic with some effort

aehlke avatar Sep 17 '23 18:09 aehlke

I've ported it into SwiftWhisper here: https://github.com/dougzilla32/SwiftWhisper/compare/master...lake-of-fire:SwiftWhisper:master#diff-bc90b919aba349b74638614ff99f2c0581ae2bcd8b4c2c816a9c9d93969853d0 still untested though. Looks like SDL can run on iOS.

aehlke avatar Sep 18 '23 14:09 aehlke

Wow, thank you so much! Might it be possible to update the README to add documentation? Also, are you planning to make a PR to merge this into the main repository?

fakerybakery avatar Sep 18 '23 16:09 fakerybakery

No plans, but I'll update here if I test it and it works

aehlke avatar Sep 18 '23 19:09 aehlke

Hi @aehlke, were you able to get it to work?

fakerybakery avatar Sep 23 '23 18:09 fakerybakery

Haven't tried yet. I will within a week or two probably

aehlke avatar Sep 23 '23 22:09 aehlke

I have created a very poor mans version of the streaming here. It works but the reading from the buffer queue needs quite a bit of improvement.

cgfarmer4 avatar Oct 04 '23 03:10 cgfarmer4

What's the downside to your queue implementation? Like what's the cost or risk of the technical debt as you implemented it - thanks

aehlke avatar Oct 08 '23 20:10 aehlke

@aehlke lost fidelity. If you test using GGeranov's implementation with AudioQueue, its a bit more accurate. I would say this implementation is like 90% good enough though.

I havent had time to invest in making it more true buffer where it puts audio drops back into the array, this is more of a FILO queue.

cgfarmer4 avatar Oct 08 '23 20:10 cgfarmer4

I tested and fixed the one I linked above. I don't have a test implementation to share but it works.

aehlke avatar Oct 09 '23 01:10 aehlke

@aehlke mind sharing a code example?

cgfarmer4 avatar Oct 09 '23 02:10 cgfarmer4

cheetah-main-2.zip here's the "Cheetah" project I linked above, locally forked to use my SwiftWhisper fork with the added SwiftWhisperStream module. I disabled most of the functions of the app - all that remains is a demo of it downloading the medium model and then showing text results as you speak, ignore the other buttons

aehlke avatar Oct 09 '23 12:10 aehlke

@aehlke this is pretty amazing. Are you using a more recent version of the code? When I try to add SwiftWhisper as a dependency from github.com/lake-of-fire/SwiftWhisper.git I get error that SwiftWhisperStream cannot be found

eni9889 avatar Oct 09 '23 14:10 eni9889

https://github.com/lake-of-fire/SwiftWhisper/blob/master/Package.swift#L20 it's here...

btw this appears to work on both iOS and macOS tho I only really tested macOS. licensing of the dependencies involved are all properly open eg MIT, no GPL

my SwiftWhisper fork is messy and could be simplified for sure, either merged into SwiftWhisper or split out as a separate thing

aehlke avatar Oct 09 '23 14:10 aehlke

@aehlke my mistake looks like I didn't actually add it to the target. Amazing work

eni9889 avatar Oct 09 '23 14:10 eni9889