vibe icon indicating copy to clipboard operation
vibe copied to clipboard

Transcribe apps audio

Open thewh1teagle opened this issue 10 months ago • 3 comments

Goal

Transcribe system audio / microfone (single or both) and preview it in realtime

Research

Possible to follow approaches in https://github.com/CapSoftware/Cap

Useful Rust Crate https://github.com/helmerapp/scap

Perhaps on: macOS: https://github.com/svtlabs/screencapturekit-rs (screen capture kit) Graphics.capture on Windows (https://github.com/NiiightmareXD/windows-capture)

macOS app which provides a way to capture system audio using ScreenCaptureKit API https://github.com/Mnpn/Azayaka

Microsoft answer for how audacity manage to record audio from speakers (TLDR: Windows WASAPI) https://answers.microsoft.com/en-us/windows/forum/all/how-record-speaker-output-windows-10/251bb695-5170-4a35-a90f-42d9f6f3345a

MacOS sample https://gist.github.com/thewh1teagle/d02415b9768fd816a780f9af6a3f2bdb

Some platforms provide virtual channels for monitoring (PulseAudio and PipeWire on Linux, WASAPI on Windows, Core Audio on macOS), though not all, and cpal does not expose them (not sure on Core Audio actually, they might have disabled it or removed it for security reasons)

Loopback added to cpal https://github.com/RustAudio/cpal/pull/478 (working in windows)

Additional questions: How to get system audio + microfone at the same time into single stream Linux?

TLDR

Rust crate cpal provides a way to get audio stream from microfone(s) On Windows it also provides audio stream from default output device (system audio) On macOS we should use screencapturekit-rs and provide stream which is equivalent to cpal stream.

If two streams used, then mix them by adding both (simple addition to the sample(s) numbers works) Push them to whisper in loop Mixing can introduce synchronization issues (is it's two different sound cards etc) and RtAudio handle that better and possible to use through rtaudio-rs whisper.cpp expects single channel (mono) 16khz rate and size of 16 bit Probably need resampling, and converting to mono from stereo is by mean of both.

Simple approach

Record from speakers/mic concurrently and write to file every 5-10 at the best silent position Write to queue of paths (each item will be one or two paths) Another task which iterate the queue, merge if needed, and transcribe it.

https://github.com/ggerganov/whisper.cpp/tree/master/examples/stream#sliding-window-mode-with-vad

thewh1teagle avatar Apr 17 '24 00:04 thewh1teagle