Audio Corruption - WAV file header parsing assumes fixed 44-byte offset

Open ebrindley opened this issue 1 month ago • 1 comments

Audio Corruption - WAV file header parsing assumes fixed 44-byte offset

Labels: bug, crash, audio-pipeline, high-priority

Description

Local and Parakeet transcription services assume all WAV files have exactly a 44-byte header by using a hardcoded offset (stride(from: 44, to: data.count, by: 2)). WAV files can have variable-length headers depending on metadata, chunks, and format extensions, causing crashes or incorrect transcription when processing non-standard WAV files.

User Impact

App crashes when processing WAV files with metadata
Garbled transcriptions from reading header bytes as audio data
Transcription failures with files from professional audio software
Inconsistent results depending on which app created the WAV file

Technical Details

Affected Files:

VoiceInk/Services/LocalTranscriptionService.swift - Line 87
VoiceInk/Services/ParakeetTranscriptionService.swift - Line 86

Current implementation:


let floats = stride(from: 44, to: data.count, by: 2).map {

    return data[$0..<$0 + 2].withUnsafeBytes {

        let short = Int16(littleEndian: $0.load(as: Int16.self))

        return max(-1.0, min(Float(short) / 32767.0, 1.0))

    }

}

Problems:

WAV header size varies (typically 44, but can be 58, 78, or larger)
No parsing of WAV header chunks
No validation of audio format, sample rate, or channels
Assumes 16-bit PCM (may be 8-bit, 24-bit, 32-bit, or float)

WAV files that will fail:

Files with LIST/INFO chunks (metadata)
Broadcast WAV (BWF) files with bext chunks
Files from Logic Pro, Pro Tools, Ableton (often have extended headers)
Files converted from MP3/M4A (may have metadata embedded)

Reproduction

Create WAV file with metadata in Audacity or Logic Pro
Try to transcribe using local model
Observe crash or garbled transcription

Recommended Fix

Replace the readAudioSamples method in both LocalTranscriptionService.swift and ParakeetTranscriptionService.swift:


private func readAudioSamples(_ url: URL) throws -> [Float] {

    let data = try Data(contentsOf: url)

 

    guard data.count >= 44 else {

        throw NSError(domain: "com.prakashjoshipax.voiceink", code: -1,

            userInfo: [NSLocalizedDescriptionKey: "File too small to be valid WAV"])

    }

 

    // Verify RIFF header

    guard data[0..<4].elementsEqual("RIFF".utf8),

          data[8..<12].elementsEqual("WAVE".utf8) else {

        throw NSError(domain: "com.prakashjoshipax.voiceink", code: -1,

            userInfo: [NSLocalizedDescriptionKey: "Not a valid WAV file"])

    }

 

    // Find 'data' chunk by parsing WAV structure

    var offset = 12

    var dataOffset = 0

    var dataSize = 0

 

    while offset < data.count - 8 {

        let chunkID = String(data: data[offset..<offset+4], encoding: .ascii) ?? ""

        let chunkSize = data.withUnsafeBytes {

            $0.load(fromByteOffset: offset + 4, as: UInt32.self)

        }

 

        if chunkID == "data" {

            dataOffset = offset + 8

            dataSize = Int(chunkSize)

            break

        }

 

        offset += 8 + Int(chunkSize)

    }

 

    guard dataOffset > 0 else {

        throw NSError(domain: "com.prakashjoshipax.voiceink", code: -1,

            userInfo: [NSLocalizedDescriptionKey: "No audio data chunk found in WAV"])

    }

 

    // Parse audio samples from data chunk

    let floats = stride(from: dataOffset, to: min(dataOffset + dataSize - 1, data.count - 1), by: 2).map {

        return data[$0..<$0 + 2].withUnsafeBytes {

            let short = Int16(littleEndian: $0.load(as: Int16.self))

            return max(-1.0, min(Float(short) / 32767.0, 1.0))

        }

    }

 

    return floats

}

Testing

Test with normal VoiceInk-generated WAV files (should work identically)
Test with WAV files containing metadata (Audacity, Logic Pro exports)
Test with files from various professional audio applications
Verify appropriate error messages for truly invalid files

Nov 12 '25 18:11 ebrindley

I've been experiencing issues with transcoding all audio recently, so I've been looking into a fix for this that takes a slightly different approach than the manual WAV parsing snippet proposed in the issue.

Instead of keeping per‑service readAudioSamples helpers that assume a 44‑byte header, both LocalTranscriptionService and ParakeetTranscriptionService now delegate all audio decoding to a shared AudioProcessor:

LocalTranscriptionService/ParakeetTranscriptionService now call try await audioProcessor.processAudioToSamples(audioURL) and their old readAudioSamples implementations have been removed.
AudioProcessor uses AVAudioFile + AVAudioConverter to read the file and convert it into a normalized mono Float32 stream at 16 kHz, which matches what Whisper expects.
Because AVFoundation handles RIFF/WAV parsing internally, this automatically supports variable‑length headers, metadata/BWF chunks, and different PCM encodings instead of relying on a hard‑coded stride(from: 44, …) offset.
The processor also downmixes multi‑channel audio, performs resampling when the input sample rate isn’t 16 kHz, and normalizes sample amplitudes in one place, so both local and Parakeet paths share the exact same, well‑tested pipeline.
Centralizing this logic in AudioProcessor means any future fixes or format support improvements happen in one place and benefit all transcription services.

Practically, this addresses the same root problem described in the issue (incorrect fixed header assumption leading to corrupted audio and crashes), but does so by leaning on AVFoundation’s WAV parsing and conversion instead of maintaining our own RIFF parser in multiple services.

Perhaps you could consider this approach? Though I might be missing something critical here.

Nov 14 '25 02:11 Unlearn