whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Speedup option with variable rate between 1x and 2x

Open Topping1 opened this issue 1 year ago • 3 comments

@ggerganov I was wondering if the speedup option can be made so it is customizable, let's say between 1x and 2x, because some audios have a sweet spot of around 1.5x for speed and intelligibility.

Topping1 avatar Jan 09 '23 22:01 Topping1

When I implemented the x2 speed-up option I did a small research on tempo speed-up algorithms and it looks like the general solution is not very trivial to implement because you have to preserve the pitch in order to have good results from Whisper. So I thought it is best to use some third-party library or tool (e.g. ffmpeg) for pre-processing the audio and then feed it to Whisper.

In the future, we can add support for that in whisper.cpp but it has to be some lightweight solution. Hopefully, someone contributes, but for now I consider this a low-priority issue.

ggerganov avatar Jan 10 '23 21:01 ggerganov

When I implemented the x2 speed-up option I did a small research on tempo speed-up algorithms and it looks like the general solution is not very trivial to implement because you have to preserve the pitch in order to have good results from Whisper. So I thought it is best to use some third-party library or tool (e.g. ffmpeg) for pre-processing the audio and then feed it to Whisper.

In the future, we can add support for that in whisper.cpp but it has to be some lightweight solution. Hopefully, someone contributes, but for now I consider this a low-priority issue.

You are right, did some reading and even the most barebones implementations use external libraries for FFT and other functions. I ended up adding the variable speedup option to the GUI I implemented, using ffmpeg (with timestamp correction for subtitle files).

Topping1 avatar Jan 17 '23 04:01 Topping1

@Topping1 Couple years ago I needed to resample realtime audio, preserving pitch. Also did some research, quickly found out that’s very complicated to implement. Then I found libsoxr library and the results were pretty good.

About quality here’s the tests, libsoxr (the combobox item is “ffmpeg 4.2.2 soxr”) is pretty close to professional software like Steinberg’s Cubase 10 and Nuendo 11. About performance, the target device had 1 GHz Allwinner ARM64 CPU, my program was using 1% of CPU time resampling realtime stereo 44.1 kHz → 48 kHz.

Many Linux distributions are even shipping the DLL in their package repositories, e.g. Alpine

Const-me avatar Jan 23 '23 12:01 Const-me