whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

perf regression compared with v1.4.0

Open knitvoger opened this issue 3 months ago • 2 comments

I tried release 1.4.0 and release 1.5.4 to process a 2.5 min audio. Release 1.5.4 is 11s slower than release 1.4.0. Latest whisper.cpp code runs even more slower. Is this expected? Is there any way to make latest whisper.cpp run faster while keep the transcribe quality?

I run below command in windows 11. .\main.exe -m ggml-small.bin -oj -ml 1 -sow -f input.wav

Release 1.5.4: whisper_print_timings: load time = 467.40 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 252.86 ms whisper_print_timings: sample time = 4312.42 ms / 3060 runs ( 1.41 ms per run) whisper_print_timings: encode time = 36698.69 ms / 6 runs ( 6116.45 ms per run) whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: batchd time = 29657.34 ms / 3033 runs ( 9.78 ms per run) whisper_print_timings: prompt time = 4671.10 ms / 947 runs ( 4.93 ms per run) whisper_print_timings: total time = 76409.26 ms

Release 1.4.0: whisper_print_timings: load time = 528.86 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 1114.53 ms whisper_print_timings: sample time = 479.16 ms / 569 runs ( 0.84 ms per run) whisper_print_timings: encode time = 42009.84 ms / 6 runs ( 7001.64 ms per run) whisper_print_timings: decode time = 20988.81 ms / 569 runs ( 36.89 ms per run) whisper_print_timings: total time = 65414.71 ms

knitvoger avatar Mar 15 '24 12:03 knitvoger

Use -bs 1 to get the old speed. The quality with more beams in general should be better, but it's possible that you don't observe much of a difference

ggerganov avatar Mar 15 '24 13:03 ggerganov

thanks @ggerganov, set bs to 1 did run faster. But i found that word timming has some regression, especially last words of sentences. The end time of last words are always in seconds, miliseonds level are all zeros. like words "sir?", "in.", "morning." in below result. Can this be solved?

[00:00:13.000 --> 00:00:13.230] May [00:00:13.230 --> 00:00:13.300] I [00:00:13.300 --> 00:00:13.600] come [00:00:13.600 --> 00:00:13.750] in [00:00:13.750 --> 00:00:14.000] sir? [00:00:14.000 --> 00:00:15.110] Yeah, [00:00:15.110 --> 00:00:15.460] please [00:00:15.460 --> 00:00:15.690] come [00:00:15.690 --> 00:00:16.000] in. [00:00:16.000 --> 00:00:16.570] Good [00:00:16.570 --> 00:00:17.560] morning [00:00:17.560 --> 00:00:18.000] sir. [00:00:18.000 --> 00:00:18.280] Good [00:00:18.280 --> 00:00:19.000] morning. [00:00:19.000 --> 00:00:19.220] Please [00:00:19.220 --> 00:00:19.360] take [00:00:19.360 --> 00:00:19.490] your [00:00:19.490 --> 00:00:19.710] seat, [00:00:19.710 --> 00:00:20.000] jeeveni.

knitvoger avatar Mar 18 '24 10:03 knitvoger