whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Revisit Log-Mel spectrogram computation

Open ggerganov opened this issue 1 year ago • 6 comments

Last time I checked, the results produced by whisper.cpp for computing the Log-Mel spectrogram were not exactly identical to the OpenAI implementation:

  • whisper.cpp: https://github.com/ggerganov/whisper.cpp/blob/master/whisper.cpp#L2284-L2298

  • OpenAI Whisper: https://github.com/openai/whisper/blob/main/whisper/audio.py#L92-L124

I think, the produced spectrograms by the 2 methods should be pretty close to each other because the transcription obviously works correctly. But nevertheless, it would be useful to compare the spectrograms in more details and see if we can make the C++ code match better the Python code. Eliminating any differences in the audio input would make it easier to compare the transcription results between the 2 codebases.

This should be a good exercise for anyone looking to start contributing to the project, so feel free to open a PR or discuss your findings!

ggerganov avatar Mar 05 '23 20:03 ggerganov

out of curiosity, have you considered rewriting mel spectrogram part on iOS/MacOS with Accelerate framework ? here is good example https://developer.apple.com/documentation/accelerate/computing_the_mel_spectrum_using_linear_algebra#overview

bexp avatar Mar 06 '23 17:03 bexp

@bexp Great suggestion! Would be nice to have this implemented - it would likely be faster compared to the existing method.

ggerganov avatar Mar 06 '23 17:03 ggerganov

I've benchmarked all of the modern ffts, you probably want pocketfft for this. Looking at porting to that now, a naive port brings the mel time from around 10ms to 1ms for me.

lunixbochs avatar Mar 08 '23 04:03 lunixbochs

PR for pocketfft here: #583

lunixbochs avatar Mar 08 '23 05:03 lunixbochs

@lunixbochs I think it would make sense to use pocketfft as general purpose implementation. For Apple platforms I'd still stick with Accelerate-based code. If you look at the example I posted the FFT generation is just one of multiple steps involved in mel generation. Anyways I hope we'll see benchmarks at some point.

bexp avatar Mar 08 '23 06:03 bexp

After my pocketfft PR, something like 75% of the log_mel computation is spent doing a matrix vector multiply here: https://github.com/ggerganov/whisper.cpp/blob/d1f16463fa8182d9436aa30287ad320492943f56/whisper.cpp#L2285-L2294

You could use Accelerate for that, but I also assume ggml could also do it just fine. My pocketfft PR also puts this at the point where log_mel accounts for only around 2% of the single-threaded time of whisper.cpp for me even with whisper-tiny, and <1% with 8 threads.

Edit: I applied a simple optimization to the matmul and it's much faster now, we're closer to 1% of the inference time in logmel.

lunixbochs avatar Mar 08 '23 06:03 lunixbochs