whisper.cpp
whisper.cpp copied to clipboard
perf regression compared with v1.4.0
I tried release 1.4.0 and release 1.5.4 to process a 2.5 min audio. Release 1.5.4 is 11s slower than release 1.4.0. Latest whisper.cpp code runs even more slower. Is this expected? Is there any way to make latest whisper.cpp run faster while keep the transcribe quality?
I run below command in windows 11. .\main.exe -m ggml-small.bin -oj -ml 1 -sow -f input.wav
Release 1.5.4: whisper_print_timings: load time = 467.40 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 252.86 ms whisper_print_timings: sample time = 4312.42 ms / 3060 runs ( 1.41 ms per run) whisper_print_timings: encode time = 36698.69 ms / 6 runs ( 6116.45 ms per run) whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: batchd time = 29657.34 ms / 3033 runs ( 9.78 ms per run) whisper_print_timings: prompt time = 4671.10 ms / 947 runs ( 4.93 ms per run) whisper_print_timings: total time = 76409.26 ms
Release 1.4.0: whisper_print_timings: load time = 528.86 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 1114.53 ms whisper_print_timings: sample time = 479.16 ms / 569 runs ( 0.84 ms per run) whisper_print_timings: encode time = 42009.84 ms / 6 runs ( 7001.64 ms per run) whisper_print_timings: decode time = 20988.81 ms / 569 runs ( 36.89 ms per run) whisper_print_timings: total time = 65414.71 ms
Use -bs 1
to get the old speed. The quality with more beams in general should be better, but it's possible that you don't observe much of a difference
thanks @ggerganov, set bs to 1 did run faster. But i found that word timming has some regression, especially last words of sentences. The end time of last words are always in seconds, miliseonds level are all zeros. like words "sir?", "in.", "morning." in below result. Can this be solved?
[00:00:13.000 --> 00:00:13.230] May [00:00:13.230 --> 00:00:13.300] I [00:00:13.300 --> 00:00:13.600] come [00:00:13.600 --> 00:00:13.750] in [00:00:13.750 --> 00:00:14.000] sir? [00:00:14.000 --> 00:00:15.110] Yeah, [00:00:15.110 --> 00:00:15.460] please [00:00:15.460 --> 00:00:15.690] come [00:00:15.690 --> 00:00:16.000] in. [00:00:16.000 --> 00:00:16.570] Good [00:00:16.570 --> 00:00:17.560] morning [00:00:17.560 --> 00:00:18.000] sir. [00:00:18.000 --> 00:00:18.280] Good [00:00:18.280 --> 00:00:19.000] morning. [00:00:19.000 --> 00:00:19.220] Please [00:00:19.220 --> 00:00:19.360] take [00:00:19.360 --> 00:00:19.490] your [00:00:19.490 --> 00:00:19.710] seat, [00:00:19.710 --> 00:00:20.000] jeeveni.