whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

AVFoundation AVAudioNode installTap audio buffer format for whisper.cpp

Open buholzer opened this issue 1 year ago • 1 comments

Any hint on how I need to convert the buffer from AVAudioNode installTap so it is compatible with whisper.cpp?

https://developer.apple.com/documentation/avfaudio/avaudionode/1387122-installtap

I tried simply to convert it to an array:

Array(UnsafeBufferPointer(start: buffer.floatChannelData![0], count: Int(buffer.frameLength)))

Also, tried to use a AVAudioConverter:

let recordingFormat = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: Double(16000.0), channels: 1, interleaved: true)
let formatConverter =  AVAudioConverter(from:inputFormat, to: recordingFormat!)

...

formatConverter.convert(to: pcmBuffer!, error: &error, withInputFrom: inputBlock)
let channelData = pcmBuffer!.floatChannelData {
            let channelDataValue = channelData.pointee
            let channelDataValueArray = stride(from: 0,
                                               to: Int(pcmBuffer!.frameLength),
                                               by: buffer.stride).map{ channelDataValue[$0] }

But both of those approaches just resulted in [BLANK_AUDIO], (sighs) or (wind_noise).

I am trying to create a stream/realtime implementation with Swift.

buholzer avatar Mar 30 '24 20:03 buholzer

I had the same problem when I built this demo. whisper.swiftui. This is a real-time demo.

The original pcmBuffer cannot be recognized and needs to be converted.

You can take a look at my implementation.

func decodePCMBuffer(_ buffer: AVAudioPCMBuffer) throws -> [Float] {
        guard let floatChannelData = buffer.floatChannelData else {
            throw NSError(domain: "Invalid PCM Buffer", code: 0, userInfo: nil)
        }
        
        let channelCount = Int(buffer.format.channelCount)
        let frameLength = Int(buffer.frameLength)
        
        var floats = [Float]()
        
        for frame in 0..<frameLength {
            for channel in 0..<channelCount {
                let floatData = floatChannelData[channel]
                let index = frame * channelCount + channel
                let floatSample = floatData[index]
                floats.append(max(-1.0, min(floatSample, 1.0)))
            }
        }
        
        return floats
    }

mpr0xy avatar Apr 07 '24 08:04 mpr0xy