Benchmark results

Open ggerganov opened this issue 1 year ago • 158 comments

Encoder

Collection of bench results for various platforms and devices. If you want to submit info about your device, simply run the bench tool or the extra/bench-all.sh and report the results in the comments below.

Suggestions for better summary of the results are welcome

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	8	71	102	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	8	96	220	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	8	233	685	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	8	603	1928	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	8	1158	3350	206fc93
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	1	251	2605	206fc93
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	4	255	884	206fc93
---
Mac Mini M1	MacOS	NEON BLAS	tiny	4	62	194	fcf515d
Mac Mini M1	MacOS	NEON BLAS	base	4	81	380	fcf515d
Mac Mini M1	MacOS	NEON BLAS	small	4	204	1249	fcf515d
Mac Mini M1	MacOS	NEON BLAS	medium	4	876	3980	fcf515d
Mac Mini M1	MacOS	NEON BLAS	large	4	1876	7979	fcf515d
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2	tiny	8	107	422	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	base	8	137	880	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	small	8	280	2874	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	medium	8	692	9610	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2	large	8	1317	16917	fcf515d
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	tiny	4	120	780	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	base	4	151	1173	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	small	4	289	3062	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	medium	4	711	9175	fcf515d
Ryzen 9 3900X	Ubuntu 20.04	AVX2 BLAS	large	4	1282	16050	fcf515d
---
Ryzen 9 5950X	Ubuntu 22.04	AVX2	tiny	8	135	197	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	base	8	176	421	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	small	8	357	1393	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	medium	8	855	4404	fcf515d
Ryzen 9 5950X	Ubuntu 22.04	AVX2	large	8	1576	8118	fcf515d
---
Raspberry Pi 4		NEON	tiny	4	1436	13839	fcf515d
Raspberry Pi 4		NEON	base	4	1894	30552	fcf515d
---
iPhone 13 Mini	iOS 16.0	NEON BLAS	base	4	97	1091	fcf515d
---
MacBook M1 Pro	Vivaldi	WASM	tiny	8	133	3785	fcf515d
MacBook M1 Pro	Vivaldi	WASM	base	8	172	8253	fcf515d
---
MacBook M1 Pro	Chrome	WASM	tiny	8	134	3776	fcf515d
MacBook M1 Pro	Chrome	WASM	base	8	168	8200	fcf515d
---
MacBook M1 Pro	Firefox	WASM	tiny	8	137	2626	fcf515d
MacBook M1 Pro	Firefox	WASM	base	8	183	6226	fcf515d

memcpy

MacBook M1 Pro

./bench -w 1 -t 1
memcpy: 37.59 GB/s

Ryzen 9 5950X

./bench -w 1 -t 1
memcpy: 16.74 GB/s

ggml_mul_mat

MacBook M1 Pro

./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16    330.6 GFLOPS (128 runs) / F32    466.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16    737.5 GFLOPS (128 runs) / F32    838.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    938.6 GFLOPS (128 runs) / F32   1062.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16   1312.5 GFLOPS (128 runs) / F32   1835.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16   1765.1 GFLOPS (128 runs) / F32   2041.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1784.3 GFLOPS (104 runs) / F32   1859.2 GFLOPS (109 runs)
ggml_mul_mat:  4096 x  4096: F16   1855.1 GFLOPS ( 14 runs) / F32   1873.3 GFLOPS ( 14 runs)

Ryzen 9 5950X

WHISPER_OPENBLAS=1 make -j bench && ./bench -w 2 -t 1
ggml_mul_mat:    64 x    64: F16     56.3 GFLOPS (128 runs) / F32     70.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     47.8 GFLOPS (128 runs) / F32     67.0 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    185.1 GFLOPS (128 runs) / F32    332.7 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    386.4 GFLOPS (128 runs) / F32    658.6 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    636.2 GFLOPS (128 runs) / F32   1012.0 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    950.9 GFLOPS ( 56 runs) / F32   1296.8 GFLOPS ( 76 runs)
ggml_mul_mat:  4096 x  4096: F16   1168.6 GFLOPS (  9 runs) / F32   1403.1 GFLOPS ( 11 runs)

Oct 25 '22 17:10 ggerganov

Results for Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
i7-4790K	Debian	tiny.en	4	165	808
i7-4790K	Debian	tiny.en	8	165	783
i7-4790K	Debian	base.en	4	212	1813
i7-4790K	Debian	base.en	8	214	1746

Oct 25 '22 18:10 cdosoftei

Results for Ryzen 5 4500U 6C/6T laptop CPU (I've just included one result for 8 threads as Encode time is much higher when threads > CPU cores).

CPU	OS	Model	Threads	Load [ms]	Encode [ms]
Ryzen 5 4500U (6C/6T)	Opensuse Leap	tiny.en	4	170.00	829.43
Ryzen 5 4500U (6C/6T)	Opensuse Leap	tiny.en	6	143.03	671.74
Ryzen 5 4500U (6C/6T)	Opensuse Leap	base.en	4	305.92	2,092.39
Ryzen 5 4500U (6C/6T)	Opensuse Leap	base.en	6	188.05	1,495.61
Ryzen 5 4500U (6C/6T)	Opensuse Leap	small.en	4	408.03	6,919.31
Ryzen 5 4500U (6C/6T)	Opensuse Leap	small.en	6	359.23	6,370.83
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	4	2,238.11	25,863.28
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	6	1,113.04	19,672.63
Ryzen 5 4500U (6C/6T)	Opensuse Leap	medium.en	8	973.65	39,619.20

Oct 26 '22 12:10 rjwilmsi

CPU OS Config Model Threads Load [ms] Encode [ms]

i7-11800H WSL2 Ubuntu AVX2 tiny 2 164.35 1087.61

i7-11800H WSL2 Ubuntu AVX2 tiny 4 128.94 733.24

i7-11800H WSL2 Ubuntu AVX2 tiny 8 137.57 619.88

i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 2 143.02 1087.15

i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 4 127.60 730.57

i7-11800H WSL2 Ubuntu AVX2 AVX512 tiny 8 125.62 616.27

i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 2 132.59 1511.38

i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 4 132.48 1407.49

i7-11800H WSL2 Ubuntu AVX2 AVX512 BLAS tiny 8 133.82 1458.27

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2	tiny	2	164.35	1087.61
i7-11800H	WSL2 Ubuntu	AVX2	tiny	4	128.94	733.24
i7-11800H	WSL2 Ubuntu	AVX2	tiny	8	137.57	619.88
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	2	143.02	1087.15
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	4	127.60	730.57
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	tiny	8	125.62	616.27
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	2	132.59	1511.38
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	4	132.48	1407.49
i7-11800H	WSL2 Ubuntu	AVX2 AVX512 BLAS	tiny	8	133.82	1458.27

Oct 26 '22 15:10 ArtyomZemlyak

CPU OS Config Model Threads Load [ms] Encode [ms]

i7-11800H WSL2 Ubuntu AVX2 base 2 174.34 2533.79

i7-11800H WSL2 Ubuntu AVX2 base 4 166.68 1830.67

i7-11800H WSL2 Ubuntu AVX2 base 8 165.53 1478.73

i7-11800H WSL2 Ubuntu AVX2 small 2 340.12 8714.24

i7-11800H WSL2 Ubuntu AVX2 small 4 394.32 6021.41

i7-11800H WSL2 Ubuntu AVX2 small 8 305.98 4828.84

i7-11800H WSL2 Ubuntu AVX2 large 2 3205.36 57109.10

i7-11800H WSL2 Ubuntu AVX2 large 4 2720.25 38519.89

i7-11800H WSL2 Ubuntu AVX2 large 8 3716.34 27739.99

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2	base	2	174.34	2533.79
i7-11800H	WSL2 Ubuntu	AVX2	base	4	166.68	1830.67
i7-11800H	WSL2 Ubuntu	AVX2	base	8	165.53	1478.73
i7-11800H	WSL2 Ubuntu	AVX2	small	2	340.12	8714.24
i7-11800H	WSL2 Ubuntu	AVX2	small	4	394.32	6021.41
i7-11800H	WSL2 Ubuntu	AVX2	small	8	305.98	4828.84
i7-11800H	WSL2 Ubuntu	AVX2	large	2	3205.36	57109.10
i7-11800H	WSL2 Ubuntu	AVX2	large	4	2720.25	38519.89
i7-11800H	WSL2 Ubuntu	AVX2	large	8	3716.34	27739.99

Oct 26 '22 15:10 ArtyomZemlyak

CPU OS Config Model Threads Load [ms] Encode [ms]

i7-11800H WSL2 Ubuntu AVX2 AVX512 large 2 1954.21 54966.84

i7-11800H WSL2 Ubuntu AVX2 AVX512 large 4 1455.40 37320.62

i7-11800H WSL2 Ubuntu AVX2 AVX512 large 8 1372.58 27937.64

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	2	1954.21	54966.84
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	4	1455.40	37320.62
i7-11800H	WSL2 Ubuntu	AVX2 AVX512	large	8	1372.58	27937.64

Oct 26 '22 15:10 ArtyomZemlyak

This performance is impressing!

M1 Pro | MacOS | | large | 8 | 1973 | 4208

Oct 26 '22 15:10 ArtyomZemlyak

This performance is impressing!

Yes, there is a huge performance boost due to using the built-in BLAS implementation on these devices. I will soon add OpenBLAS support for x86 architectures and see how this compares.

By the way, AVX-512 is not supported on master. I have added initial support here, but I am not sure if it works: https://github.com/ggerganov/whisper.cpp/pull/95

Oct 26 '22 19:10 ggerganov

CPU OS Config Model Threads Load[ms] encode[ms]

Intel® Core™ i5-8250U Win11 Home AVX2 Large 8 2226.85 61547.61

CPU	OS	Config	Model	Threads	Load[ms]	encode[ms]
Intel® Core™ i5-8250U	Win11 Home	AVX2	Large	8	2226.85	61547.61

compiled with MinGW64 gcc 11.3

Oct 28 '22 20:10 cristianglezm

Valve Jupiter (AMD Custom APU 0405, Zen 2 microarch, 4c8t, 16GB DDR5 @ 5200 MT/s)

CPU	OS	Config	Model	Threads	Load[ms]	encode[ms]
AMD Custom APU 0405	SteamOS 3.2	AVX2	Base	8	326.32	2592.96

Compiled with cc (GCC) 11.3.0

The performance gains on jfk.wav since last test (two weeks or so ago) are extremely impressive, ~10-20x speedup from 40 to 2-4 seconds.

Oct 29 '22 00:10 tazz4843

CPU OS Config Model Threads Load [ms] Encode [ms]

MacBook M1 Max macOS Ventura BLAS small 1 299.09 4166.00

MacBook M1 Max macOS Ventura BLAS small 4 329.45 1304.32

MacBook M1 Max macOS Ventura BLAS base 1 139.10 1302.17

MacBook M1 Max macOS Ventura BLAS base 4 135.96 399.45

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
MacBook M1 Max	macOS Ventura	BLAS	small	1	299.09	4166.00
MacBook M1 Max	macOS Ventura	BLAS	small	4	329.45	1304.32
MacBook M1 Max	macOS Ventura	BLAS	base	1	139.10	1302.17
MacBook M1 Max	macOS Ventura	BLAS	base	4	135.96	399.45

Oct 30 '22 00:10 yujinqiu

On a AMD EPYC 64 core 240 threads cloud instance it is stuck like this with 240 threads. I noticed that above a certain number of threads its slow, or the cloud provider is cpu limiting. Can anyone else with real hardware check if this is the case?

time ./main -m models/ggml-base.en.bin -f elon.wav -t 240
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 240 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ..

Oct 31 '22 12:10 trholding

So I have tried with the above mentioned cloud provider various number of threads.

I found that anything above 64 threads gets slower and usable upto 120 threads. Anything above is a hang. Must be that the cloud provider is throttling on free trial or too many threads could actually slow down stuff.

...
...
processor       : 239
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7742 64-Core Processor
stepping        : 0
microcode       : 0x830104d
cpu MHz         : 2245.780
cache size      : 512 KB
physical id     : 1
siblings        : 120
core id         : 59
cpu cores       : 60
apicid          : 247
initial apicid  : 247
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4491.56
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

time ./main -m models/ggml-base.en.bin -f elon.wav -t 64
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 670.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 64 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing 'elon.wav' (34466688 samples, 2154.2 sec), 64 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.960]   [MUSIC PLAYING]
[00:00:03.960 --> 00:00:18.240]   In life, we've seen within this part of the world
...
...
[00:35:40.320 --> 00:35:41.920]   Thank you, and have a great day.
[00:35:41.920 --> 00:35:43.920]   [APPLAUSE]
[00:35:43.920 --> 00:35:45.920]   [MUSIC PLAYING]
[00:35:45.920 --> 00:35:56.240]   [VIDEO PLAYBACK]


whisper_print_timings:     load time =   249.61 ms
whisper_print_timings:      mel time =  1267.11 ms
whisper_print_timings:   sample time =  1718.69 ms
whisper_print_timings:   encode time = 63702.25 ms / 10617.04 ms per layer
whisper_print_timings:   decode time = 381317.66 ms / 63552.94 ms per layer
whisper_print_timings:    total time = 448362.19 ms

real    7m28.411s
user    347m2.230s
sys     22m42.511s

32 threads was faster than 64 threads. I think 32 threads took around 7 minutes or so.

Oct 31 '22 12:10 trholding

Env: Restricted Cloud / Throttled Maybe

CPU: AMD EPYC 7742 64-Core Processor

OS:

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal
Linux XXXX 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Compiler:

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

$ ./bench -m ./models/ggml-small.en.bin -t 4
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

whisper_print_timings:     load time =   515.02 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6878.32 ms / 573.19 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  7393.42 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler

$ ./bench -m ./models/ggml-small.en.bin -t 240
whisper_model_load: loading model from './models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1588.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 240 / 240 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

whisper_print_timings:     load time =   528.66 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 12898.34 ms / 1074.86 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 13427.03 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler

I'll remove the above posts if too much clutter.

Oct 31 '22 13:10 trholding

@trholding Thanks for the results.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Regarding the threads - yes, it seems that going beyond 8 threads does not help regardless of how many cores you have. My guess is that the computation is memory-bound so that's why using more threads does not improve the performance.

Oct 31 '22 17:10 ggerganov

Okay, 8 threads max, so for a large file, is there a possibility of splitting the file to chunks with silences as terminators and dividing the conversion to ((total threads/cores)/8) but also keeping track of timestamps? This could be awesome for batch conversion.

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Oh, I didn't know, I'll update with tables soon and remove my previous comments in a few hours.

Oct 31 '22 18:10 trholding

You can generate a table with performance results by simply running the extra/bench_all.sh script.

Hey Sorry. That didn't pan out well, I did the benchmark thrice, my account got deleted without notice. Could't get the logs as it was a web terminal. On the other hand I am happy that this happened, I was giving serious thought of purchasing a GPU+CPU plan there, so performance check of CPU was equally important. Probably or technically it was my fault - probably shouldn't have used a reverse shell and done benchmarks on a free trial, but how does one know if a service is real good or all just vapor...

Oct 31 '22 22:10 trholding

Dell Precision 5560 laptop results:

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-11850H	Ubuntu	AVX2	tiny	4	115.87	538.43
i7-11850H	Ubuntu	AVX2	base	4	145.14	1241.84
i7-11850H	Ubuntu	AVX2	small	4	299.30	4343.57
i7-11850H	Ubuntu	AVX2	medium	4	760.98	15238.31
i7-11850H	Ubuntu	AVX2	large	4	1404.32	27476.86
i7-11850H	Ubuntu	AVX2	tiny	8	131.96	358.81
i7-11850H	Ubuntu	AVX2	base	8	166.61	839.31
i7-11850H	Ubuntu	AVX2	small	8	320.29	2854.86
i7-11850H	Ubuntu	AVX2	medium	8	756.20	9829.62
i7-11850H	Ubuntu	AVX2	large	8	1382.38	19872.81

Nov 05 '22 06:11 rgerganov

CPU OS Config Model Threads Load [ms] Encode [ms]

i9-9900K WSL2 Ubuntu (GCC) AVX2 tiny.en 4 85.71 601.56

i9-9900K WSL2 Ubuntu (GCC) AVX2 small.en 4 212.59 5146.23

i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2 tiny.en 4 198.17 455.12

i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2 base.en 4 272.62 909.71

i9-9900K OSX 10.14.1 (hackintosh - GCC) AVX2 small.en 4 598.75 2968.75

Xeon(R) Silver 4210R CPU @ 2.40GHz Virtual Machine - Debian Stretch (GCC - master branch) AVX2 avx512f avx512dq avx512cd avx512bw avx512vl small.en 4 776.56 12340.41

Xeon(R) Silver 4210R CPU @ 2.40GHz Virtual Machine - Debian Stretch (GCC - master branch) AVX2 avx512f avx512dq avx512cd avx512bw avx512vl tiny.en 4 295.54 1710.46

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i9-9900K	WSL2 Ubuntu (GCC)	AVX2	tiny.en	4	85.71	601.56
i9-9900K	WSL2 Ubuntu (GCC)	AVX2	small.en	4	212.59	5146.23
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	tiny.en	4	198.17	455.12
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	base.en	4	272.62	909.71
i9-9900K	OSX 10.14.1 (hackintosh - GCC)	AVX2	small.en	4	598.75	2968.75
Xeon(R) Silver 4210R CPU @ 2.40GHz	Virtual Machine - Debian Stretch (GCC - master branch)	AVX2 avx512f avx512dq avx512cd avx512bw avx512vl	small.en	4	776.56	12340.41
Xeon(R) Silver 4210R CPU @ 2.40GHz	Virtual Machine - Debian Stretch (GCC - master branch)	AVX2 avx512f avx512dq avx512cd avx512bw avx512vl	tiny.en	4	295.54	1710.46

Nov 05 '22 10:11 jaybinks

CPU OS Config Model Threads Load [ms] Encode [ms]

i9-11950H Pop!_OS 22.04 LTS AVX2 Tiny 4 124.28 656.41

i9-11950H Pop!_OS 22.04 LTS AVX2 Tiny 8 123.70 696.41

i9-11950H Pop!_OS 22.04 LTS AVX2 Base 4 159.91 1754.44

i9-11950H Pop!_OS 22.04 LTS AVX2 Base 8 164.47 1658.55

i9-11950H Pop!_OS 22.04 LTS AVX2 Small 4 330.91 6161.86

i9-11950H Pop!_OS 22.04 LTS AVX2 Small 8 346.22 5187.85

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Tiny	4	124.28	656.41
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Tiny	8	123.70	696.41
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Base	4	159.91	1754.44
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Base	8	164.47	1658.55
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Small	4	330.91	6161.86
i9-11950H	Pop!_OS 22.04 LTS	AVX2	Small	8	346.22	5187.85

Nov 08 '22 09:11 mark-beeby

CPU OS Config Model Threads Load [ms] Encode [ms]

i7-1065G7 Windows 11 - small.en 4 1,314.25 294,168.09

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-1065G7	Windows 11	-	small.en	4	1,314.25	294,168.09

Compiled with VS 2022

Something is off, right?

Nov 09 '22 19:11 niksedk

Yup - you are missing the AVX2 flag. See if some of the comments in https://github.com/ggerganov/whisper.cpp/issues/5 can help you resolve this.

Nov 09 '22 20:11 ggerganov

OK, the AVX2 flag seems to help :)

CPU	OS	Config	Model	Threads	Load [ms]	Encode [ms]
i7-1065G7	Windows 11	AVX2	small.en	4	527.59	9,648.67

Compiled with VS 2022

Nov 09 '22 20:11 niksedk

whisper.cpp whisper.cpp copied to clipboard

Benchmark results

Encoder

memcpy

MacBook M1 Pro

Ryzen 9 5950X

ggml_mul_mat

MacBook M1 Pro

Ryzen 9 5950X

whisper.cpp
whisper.cpp copied to clipboard