whisper.cpp
whisper.cpp copied to clipboard
Revisit Log-Mel spectrogram computation
Last time I checked, the results produced by whisper.cpp
for computing the Log-Mel spectrogram were not exactly identical to the OpenAI implementation:
-
whisper.cpp
: https://github.com/ggerganov/whisper.cpp/blob/master/whisper.cpp#L2284-L2298 -
OpenAI Whisper: https://github.com/openai/whisper/blob/main/whisper/audio.py#L92-L124
I think, the produced spectrograms by the 2 methods should be pretty close to each other because the transcription obviously works correctly. But nevertheless, it would be useful to compare the spectrograms in more details and see if we can make the C++ code match better the Python code. Eliminating any differences in the audio input would make it easier to compare the transcription results between the 2 codebases.
This should be a good exercise for anyone looking to start contributing to the project, so feel free to open a PR or discuss your findings!
out of curiosity, have you considered rewriting mel spectrogram part on iOS/MacOS with Accelerate framework ? here is good example https://developer.apple.com/documentation/accelerate/computing_the_mel_spectrum_using_linear_algebra#overview
@bexp Great suggestion! Would be nice to have this implemented - it would likely be faster compared to the existing method.
I've benchmarked all of the modern ffts, you probably want pocketfft for this. Looking at porting to that now, a naive port brings the mel time from around 10ms to 1ms for me.
PR for pocketfft here: #583
@lunixbochs I think it would make sense to use pocketfft as general purpose implementation. For Apple platforms I'd still stick with Accelerate-based code. If you look at the example I posted the FFT generation is just one of multiple steps involved in mel generation. Anyways I hope we'll see benchmarks at some point.
After my pocketfft PR, something like 75% of the log_mel computation is spent doing a matrix vector multiply here: https://github.com/ggerganov/whisper.cpp/blob/d1f16463fa8182d9436aa30287ad320492943f56/whisper.cpp#L2285-L2294
You could use Accelerate for that, but I also assume ggml could also do it just fine. My pocketfft PR also puts this at the point where log_mel accounts for only around 2% of the single-threaded time of whisper.cpp for me even with whisper-tiny, and <1% with 8 threads.
Edit: I applied a simple optimization to the matmul and it's much faster now, we're closer to 1% of the inference time in logmel.