whisper.cpp
whisper.cpp copied to clipboard
Try to add AVX 512-bit support
Update:
This PR adds AVX 512-bit support. The performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it.
OUTDATED BELOW
WIP in progress
This is not tested because I don't have a AVX 512-bit CPU, so very likely that the code will fail. Still, would appreciate if someone gives it a try and report issues.
@ArtyomZemlyak Are you interested in giving it a try? I noticed you have a CPU with AVX 512-bit support
Hi @ggerganov, I gave it a try out of curiosity on my i7-1165G7 on Ubuntu 22.04, it does not work but unfortunately there isn't much to report. The script just runs forever. Let me know if there is a way to provide verbose logs.
avx512
branch:
➜ make
cc -I. -O3 -std=c11 -pthread -mavx512f -mavx512dq -mfma -mf16c -c ggml.c
g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp
g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main
❯ ./main -v -l fr -m ../whisper.cpp/models/ggml-medium.bin -f ../whisper.cpp/testfile-16b.wav
whisper_model_load: loading model from '../whisper.cpp/models/ggml-medium.bin'
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 4
whisper_model_load: mem_required = 2610.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 1644.98 MB
whisper_model_load: memory size = 182.62 MB
whisper_model_load: model size = 1462.12 MB
main: processing '../whisper.cpp/testfile-16b.wav' (1243847 samples, 77.7 sec), 4 threads, lang = fr, task = transcribe, timestamps = 1 ...
master
branch for reference:
➜ make
cc -I. -O3 -std=c11 -pthread -mavx -mavx2 -mfma -mf16c -c ggml.c
g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp
g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main
➜ ./main -l fr -m models/ggml-medium.bin -f testfile-16b.wav
whisper_model_load: loading model from 'models/ggml-medium.bin'
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 4
whisper_model_load: mem_required = 2610.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 1644.98 MB
whisper_model_load: memory size = 182.62 MB
whisper_model_load: model size = 1462.12 MB
main: processing 'testfile-16b.wav' (1243847 samples, 77.7 sec), 4 threads, lang = fr, task = transcribe, timestamps = 1 ...
whisper_print_timings: load time = 1196.27 ms
whisper_print_timings: mel time = 712.04 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 133803.97 ms / 5575.17 ms per layer
whisper_print_timings: decode time = 28689.79 ms / 1195.41 ms per layer
whisper_print_timings: total time = 164578.70 ms
system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
And my cpuinfo:
➜ cat /proc/cpuinfo | grep avx512
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear ibt flush_l1d arch_capabilities
Yes its interesting! Compiled. Tryied bench (tiny -t 8):
whisper_model_load: loading model from '../models/ggml-model-tiny.bin'
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem_required = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 84.99 MB
whisper_model_load: memory size = 11.41 MB
whisper_model_load: model size = 73.54 MB
whisper_print_timings: load time = 118.11 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 3359.54 ms / 839.89 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 3477.65 ms
system_info: n_threads = 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
Seems its much slower for all models
3 runs of tiny -t 8 on AVX512 and master (AVX2) branches
Cpu info (all):
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2
smep bmi2 erms invpcid avx512f
avx512dq
rdseed adx smap avx512ifma
clflushopt clwb avx512cd
sha_ni avx512bw
avx512vl
xsaveopt xsavec xgetbv1 xsaves avx512vbmi
umip avx512_vbmi2
gfni vaes vpclmulqdq avx512_vnni
avx512_bitalg
avx512_vpopcntdq
rdpid movdiri movdir64b fsrm avx512_vp2intersect
flush_l1d arch_capabilities
@lerela @ArtyomZemlyak Just pushed another version - I'm doing this blindly, so not sure if it works
Still not working. I tried with the tiny
model, it does terminate (I wasn't patient enough yesterday) and it's faster than before but still behind master
, and there is no output (it just prints the stats but no text).
There is a lot of variance between runs but here are some timings:
avx512
:
whisper_print_timings: load time = 253.09 ms
whisper_print_timings: mel time = 710.95 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 16384.33 ms / 4096.08 ms per layer
whisper_print_timings: decode time = 35269.12 ms / 8817.28 ms per layer
whisper_print_timings: total time = 54139.26 ms
master
:
whisper_print_timings: load time = 240.68 ms
whisper_print_timings: mel time = 1357.53 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 11729.50 ms / 2932.38 ms per layer
whisper_print_timings: decode time = 8131.67 ms / 2032.92 ms per layer
whisper_print_timings: total time = 21719.72 ms
Called with ./main -l fr -pc -m models/ggml-tiny.bin -f testfile.wav
.
tested on my : Xeon(R) Silver 4210R CPU @ 2.40GHz ( VM with 8 cores only )
avx512 branch :
jay.binks@tools2:~/src/whisper.cpp$ ./main -v -l fr -m ../whisper.cpp/models/ggml-small.en.bin -f ../whisper.cpp/samples/jfk.wav
whisper_model_load: loading model from '../whisper.cpp/models/ggml-small.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 768
whisper_model_load: n_text_head = 12
whisper_model_load: n_text_layer = 12
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 3
whisper_model_load: mem_required = 1048.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 533.05 MB
whisper_model_load: memory size = 68.48 MB
whisper_model_load: model size = 464.44 MB
system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
main: WARNING: model is not multilingual, ignoring language and translation options
main: processing '../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...
whisper_print_timings: load time = 762.59 ms
whisper_print_timings: mel time = 154.38 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 10756.24 ms / 896.35 ms per layer
whisper_print_timings: decode time = 10072.97 ms / 839.41 ms per layer
whisper_print_timings: total time = 21918.74 ms
master:
jay.binks@tools2:~/src/whisper.cpp$ ./main -v -l fr -m ../whisper.cpp/models/ggml-small.en.bin -f ../whisper.cpp/samples/jfk.wav
whisper_model_load: loading model from '../whisper.cpp/models/ggml-small.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 768
whisper_model_load: n_text_head = 12
whisper_model_load: n_text_layer = 12
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 3
whisper_model_load: mem_required = 1044.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size = 68.48 MB
whisper_model_load: model size = 464.44 MB
system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
main: WARNING: model is not multilingual, ignoring language and translation options
main: processing '../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:08.000] And so, my fellow Americans, ask not what your country can do for you.
[00:00:08.000 --> 00:00:11.000] Ask what you can do for your country.
whisper_print_timings: load time = 750.77 ms
whisper_print_timings: mel time = 140.73 ms
whisper_print_timings: sample time = 23.58 ms
whisper_print_timings: encode time = 11561.66 ms / 963.47 ms per layer
whisper_print_timings: decode time = 1224.77 ms / 102.06 ms per layer
whisper_print_timings: total time = 13703.20 ms
@jaybinks Just pushed another fix
I'll test it soon...
What timezone are you in ? I'm in GMT +10 Australia,. Happy to jump on a call and test with you if you like
On Sun, 6 Nov 2022, 6:57 am Georgi Gerganov, @.***> wrote:
@jaybinks https://github.com/jaybinks Just pushed another fix
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/pull/95#issuecomment-1304640327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR67XJF2VD5IVXMTNXTTWG3C5LANCNFSM6AAAAAARPESVCE . You are receiving this because you were mentioned.Message ID: @.***>
have re-tested, and it seems to be no better (possibly worse) ./bench -m ./models/ggml-small.en.bin -t 4
before pulling your recent change :
whisper_print_timings: load time = 1096.42 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 11793.09 ms / 982.76 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 12890.37 ms
after commit c4350356deafcc748b64f6aece9d4de2cf223de5
whisper_print_timings: load time = 785.93 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 13540.00 ms / 1128.33 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 14325.94 ms
I'm not convinced it's worse, some runs were "total time 12sec". its just not heaps better.
I found a machine with AVX-512 CPU and fixed the code. It now produces correct results, but the performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it.
Hey, do you have any plans to add gpu support to whisper.cpp ?
Just curious what your plans are
On Sun, 6 Nov 2022, 5:10 pm Georgi Gerganov, @.***> wrote:
I found a machine with AVX-512 CPU and fixed the code. It now produces correct results, but the performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it.
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/pull/95#issuecomment-1304734572, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR6ZDVITCRDJVWMV6KMLWG5KWRANCNFSM6AAAAAARPESVCE . You are receiving this because you were mentioned.Message ID: @.***>
Adding GPU support is not out of the question, but it's low priority atm. Here are some additional thoughts on this: https://github.com/ggerganov/whisper.cpp/discussions/126