WhisperKit
WhisperKit copied to clipboard
Want to use AVCaptureSession buffers instead of AVAudioEngine
Hey there!
First off, thanks so much for building this awesome library! Its a total pleasure to use and works great. Looking forward to the Metal
update. In the meantime, I was curious if you all would accept a PR to allow for AVCaptureSession
to be used in the AudioProcessor
class instead of AVAudioEngine
.
I was thinking of creating a way to pass in a new setupEngine function that allowed for the captureOutput
delegate to be used in place of the installTap
function. The reason I want to do this is it makes it easier to change the microphone in app instead of relying on the system default.
- Would it make sense to allow for this in the
AudioProcessor
? If so, Im happy to come up with a clean interface proposal. - If no, perhaps theres a way to override the
AudioProcessor
class and provide an alternatesetupEngine
function?
Thanks for the note @cgfarmer4! @ZachNagengast what do you think?
Ah I just found this code ;) // TODO: implement selecting input device
Decided against using AVCaptureSession
but instead to just change the device using CoreAudio. Seems to work for macbook + continuity microphone but haven't figured out why it doesn't work for my audio interface yet. Thoughts on this if I can figure out why it doesnt work for my external audio interface?
- New
assignMicrophoneInput
function:
func assignMicrophoneInput(inputNode: AVAudioInputNode, inputDeviceID: AudioDeviceID) {
guard let audioUnit = inputNode.audioUnit else {
Logging.error("Failed to access the audio unit of the input node.")
return
}
var inputDeviceID = inputDeviceID
let error = AudioUnitSetProperty(
audioUnit,
kAudioOutputUnitProperty_CurrentDevice,
kAudioUnitScope_Global,
0,
&inputDeviceID,
UInt32(MemoryLayout<AudioDeviceID>.size)
)
if error != noErr {
Logging.error("Error setting Audio Unit property: \(error)")
} else {
Logging.info("Successfully set input device.")
}
}
- Update setupEngine
func setupEngine(inputDeviceID: AudioDeviceID? = nil) throws -> AVAudioEngine {
let audioEngine = AVAudioEngine()
let inputNode = audioEngine.inputNode
let inputFormat = inputNode.outputFormat(forBus: 0)
if let inputDeviceID = inputDeviceID {
assignMicrophoneInput(inputNode: inputNode, inputDeviceID: inputDeviceID)
}
- Update start recording function to allow for passing in the
AudioDeviceID
func startRecordingLive(inputDeviceID: AudioDeviceID? = nil, callback: (([Float]) -> Void)? = nil) throws {
audioSamples = []
audioEnergy = []
audioEngine = try setupEngine(inputDeviceID: inputDeviceID)
// Set the callback
audioBufferCallback = callback
}
Going to see if I can try some tactics from this thread for my interface but seems hacky.
@cgfarmer4 thanks for the effort looking into this. This looks promising, although I would also support an additional method that uses AVCaptureSession to generate audioSamples in case some folks already had easy access their apps AVCaptureDevice. There is nothing specifically tied to audioengine in the protocol, we'd just need to make sure it has handling for the various different platforms that don't have access to those apis (watchOS for example doesn't support it). Curious to see how your tests go and would be happy to integrate these back into the AudioProcessor depending on the results.
Decided against using AVCaptureSession
since theres quite a bit of buffer conversion involved that likely adds latency (loosely held hypothesis). This meets my needs since I can take the AVCaptureSession
selected AVCaptureDevice
and get the AudioDeviceID
from it on macOS
. AVCaptureSession
would give us a list of devices on other OSes but what it wont do, is allow for AVAudioEngine
to have its audioUnit
changed. In order to get a more comprehensive list of devices from other OSes, we'd need to figure out the buffer conversion mechanism and keep it fast enough from CMSampleBuffer
to AVAudioPCMBuffer
.
https://github.com/argmaxinc/WhisperKit/pull/51
static func getAudioDeviceID(for captureDevice: AVCaptureDevice) -> AudioDeviceID? {
var propertySize: UInt32 = 0
var address = AudioObjectPropertyAddress(
mSelector: kAudioHardwarePropertyDevices,
mScope: kAudioObjectPropertyScopeGlobal,
mElement: kAudioObjectPropertyElementMain
)
AudioObjectGetPropertyDataSize(AudioObjectID(kAudioObjectSystemObject), &address, 0, nil, &propertySize)
let deviceCount = Int(propertySize) / MemoryLayout<AudioDeviceID>.size
var deviceIDs = [AudioDeviceID](repeating: 0, count: deviceCount)
let status = AudioObjectGetPropertyData(AudioObjectID(kAudioObjectSystemObject), &address, 0, nil, &propertySize, &deviceIDs)
if status == noErr {
for id in deviceIDs {
var uidSize: UInt32 = 0
var uidAddress = AudioObjectPropertyAddress(
mSelector: kAudioDevicePropertyDeviceUID,
mScope: kAudioObjectPropertyScopeGlobal,
mElement: kAudioObjectPropertyElementMain
)
AudioObjectGetPropertyDataSize(id, &uidAddress, 0, nil, &uidSize)
var deviceUID: Unmanaged<CFString>?
var uidPropertySize = UInt32(MemoryLayout.size(ofValue: deviceUID))
let uidStatus = AudioObjectGetPropertyData(id, &uidAddress, 0, nil, &uidPropertySize, &deviceUID)
if uidStatus == noErr, let deviceUID = deviceUID?.takeUnretainedValue() as String? {
if captureDevice.uniqueID == deviceUID {
return id
}
} else {
logger.error("Failed to get device UID with error: \(uidStatus)")
}
}
} else {
logger.error("Failed to get device IDs with error: \(status)")
}
return nil
}
Previously when I was working with SwiftWhisper
, I could translate the format from CMSampleBuffers
using the float conversion method here. This implementation was not great but good enough for demos. Im curious if theres some conversion here that might be doing similar?
https://gist.github.com/cgfarmer4/182d9d6d1cdf9d219ba0a4db6a23d745#file-capturedelegate-swift-L1-L46 https://gist.github.com/cgfarmer4/182d9d6d1cdf9d219ba0a4db6a23d745#file-audiosessionmanager-swift-L88-L111
I would be interested in having this integrate with AVCaptureSession too. Given the ability to use UVC Capture Devices in iPadOS 17 which is accessed via AVCaptureSession it'd be handy to pass a CMSampleBuffer directly into WhisperKit. Allows audio sources from any HDMI Video Source, Cameras, Game Consoles, etc.
I'll need to do some testing, I already have a pipeline setup for Video and Audio Processing from CMSampleBuffer. I will explore using the code snippets linked from @cgfarmer4 to convert the CMSampleBuffer to a [Float] and then pass that into WhisperKit to see if I can get it working.
Currently CMSampleBuffer operates at its own cadence in terms of sample rate as it's essentially just used to display Audio Meters based on 'averagePowerLevel'. Looks like I might need to force sample rate to be 16000 and to only pass in audio from Channel 1 should the audio source be Stereo.
Thanks!