whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Try to add AVX 512-bit support

Open ggerganov opened this issue 1 year ago • 16 comments

Update:

This PR adds AVX 512-bit support. The performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it.


OUTDATED BELOW

WIP in progress

This is not tested because I don't have a AVX 512-bit CPU, so very likely that the code will fail. Still, would appreciate if someone gives it a try and report issues.

@ArtyomZemlyak Are you interested in giving it a try? I noticed you have a CPU with AVX 512-bit support

ggerganov avatar Oct 26 '22 15:10 ggerganov

Hi @ggerganov, I gave it a try out of curiosity on my i7-1165G7 on Ubuntu 22.04, it does not work but unfortunately there isn't much to report. The script just runs forever. Let me know if there is a way to provide verbose logs.

avx512 branch:

➜ make
cc  -I.              -O3 -std=c11   -pthread -mavx512f -mavx512dq -mfma -mf16c   -c ggml.c
g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp
g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main 

❯ ./main -v -l fr -m ../whisper.cpp/models/ggml-medium.bin -f ../whisper.cpp/testfile-16b.wav
whisper_model_load: loading model from '../whisper.cpp/models/ggml-medium.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 4
whisper_model_load: mem_required  = 2610.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 1644.98 MB
whisper_model_load: memory size =   182.62 MB 
whisper_model_load: model size  =  1462.12 MB

main: processing '../whisper.cpp/testfile-16b.wav' (1243847 samples, 77.7 sec), 4 threads, lang = fr, task = transcribe, timestamps = 1 ...

master branch for reference:

➜ make
cc  -I.              -O3 -std=c11   -pthread -mavx -mavx2 -mfma -mf16c   -c ggml.c
g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp
g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main 

➜ ./main -l fr -m models/ggml-medium.bin -f testfile-16b.wav 
whisper_model_load: loading model from 'models/ggml-medium.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 4
whisper_model_load: mem_required  = 2610.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 1644.98 MB
whisper_model_load: memory size =   182.62 MB 
whisper_model_load: model size  =  1462.12 MB

main: processing 'testfile-16b.wav' (1243847 samples, 77.7 sec), 4 threads, lang = fr, task = transcribe, timestamps = 1 ...

whisper_print_timings:     load time =  1196.27 ms
whisper_print_timings:      mel time =   712.04 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 133803.97 ms / 5575.17 ms per layer
whisper_print_timings:   decode time = 28689.79 ms / 1195.41 ms per layer
whisper_print_timings:    total time = 164578.70 ms

system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

And my cpuinfo:

➜ cat /proc/cpuinfo | grep avx512
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear ibt flush_l1d arch_capabilities

lerela avatar Oct 26 '22 21:10 lerela

Yes its interesting! Compiled. Tryied bench (tiny -t 8):

whisper_model_load: loading model from '../models/ggml-model-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  84.99 MB
whisper_model_load: memory size =    11.41 MB 
whisper_model_load: model size  =    73.54 MB

whisper_print_timings:     load time =   118.11 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  3359.54 ms / 839.89 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =  3477.65 ms

system_info: n_threads = 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

ArtyomZemlyak avatar Oct 27 '22 07:10 ArtyomZemlyak

Seems its much slower for all models

ArtyomZemlyak avatar Oct 27 '22 07:10 ArtyomZemlyak

3 runs of tiny -t 8 on AVX512 and master (AVX2) branches image

ArtyomZemlyak avatar Oct 27 '22 07:10 ArtyomZemlyak

Cpu info (all):

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect flush_l1d arch_capabilities

ArtyomZemlyak avatar Oct 27 '22 07:10 ArtyomZemlyak

@lerela @ArtyomZemlyak Just pushed another version - I'm doing this blindly, so not sure if it works

ggerganov avatar Oct 27 '22 14:10 ggerganov

Still not working. I tried with the tiny model, it does terminate (I wasn't patient enough yesterday) and it's faster than before but still behind master, and there is no output (it just prints the stats but no text).

There is a lot of variance between runs but here are some timings:

avx512:

whisper_print_timings:     load time =   253.09 ms
whisper_print_timings:      mel time =   710.95 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 16384.33 ms / 4096.08 ms per layer
whisper_print_timings:   decode time = 35269.12 ms / 8817.28 ms per layer
whisper_print_timings:    total time = 54139.26 ms

master:

whisper_print_timings:     load time =   240.68 ms
whisper_print_timings:      mel time =  1357.53 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 11729.50 ms / 2932.38 ms per layer
whisper_print_timings:   decode time =  8131.67 ms / 2032.92 ms per layer
whisper_print_timings:    total time = 21719.72 ms

Called with ./main -l fr -pc -m models/ggml-tiny.bin -f testfile.wav.

lerela avatar Oct 27 '22 14:10 lerela

tested on my : Xeon(R) Silver 4210R CPU @ 2.40GHz ( VM with 8 cores only )

avx512 branch :

jay.binks@tools2:~/src/whisper.cpp$ ./main -v -l fr -m ../whisper.cpp/models/ggml-small.en.bin -f ../whisper.cpp/samples/jfk.wav 
whisper_model_load: loading model from '../whisper.cpp/models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1048.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 533.05 MB
whisper_model_load: memory size =    68.48 MB 
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: WARNING: model is not multilingual, ignoring language and translation options
main: processing '../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...



whisper_print_timings:     load time =   762.59 ms
whisper_print_timings:      mel time =   154.38 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 10756.24 ms / 896.35 ms per layer
whisper_print_timings:   decode time = 10072.97 ms / 839.41 ms per layer
whisper_print_timings:    total time = 21918.74 ms

master:

jay.binks@tools2:~/src/whisper.cpp$ ./main -v -l fr -m ../whisper.cpp/models/ggml-small.en.bin -f ../whisper.cpp/samples/jfk.wav 
whisper_model_load: loading model from '../whisper.cpp/models/ggml-small.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1044.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 8 | AVX2 = 1 | AVX512 = 1 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: WARNING: model is not multilingual, ignoring language and translation options
main: processing '../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.000]   And so, my fellow Americans, ask not what your country can do for you.
[00:00:08.000 --> 00:00:11.000]   Ask what you can do for your country.


whisper_print_timings:     load time =   750.77 ms
whisper_print_timings:      mel time =   140.73 ms
whisper_print_timings:   sample time =    23.58 ms
whisper_print_timings:   encode time = 11561.66 ms / 963.47 ms per layer
whisper_print_timings:   decode time =  1224.77 ms / 102.06 ms per layer
whisper_print_timings:    total time = 13703.20 ms

jaybinks avatar Nov 05 '22 11:11 jaybinks

@jaybinks Just pushed another fix

ggerganov avatar Nov 05 '22 20:11 ggerganov

I'll test it soon...

What timezone are you in ? I'm in GMT +10 Australia,. Happy to jump on a call and test with you if you like

On Sun, 6 Nov 2022, 6:57 am Georgi Gerganov, @.***> wrote:

@jaybinks https://github.com/jaybinks Just pushed another fix

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/pull/95#issuecomment-1304640327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR67XJF2VD5IVXMTNXTTWG3C5LANCNFSM6AAAAAARPESVCE . You are receiving this because you were mentioned.Message ID: @.***>

jaybinks avatar Nov 05 '22 23:11 jaybinks

have re-tested, and it seems to be no better (possibly worse) ./bench -m ./models/ggml-small.en.bin -t 4

before pulling your recent change :

whisper_print_timings:     load time =  1096.42 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 11793.09 ms / 982.76 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 12890.37 ms

after commit c4350356deafcc748b64f6aece9d4de2cf223de5

whisper_print_timings:     load time =   785.93 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 13540.00 ms / 1128.33 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 14325.94 ms

I'm not convinced it's worse, some runs were "total time 12sec". its just not heaps better.

jaybinks avatar Nov 06 '22 05:11 jaybinks

I found a machine with AVX-512 CPU and fixed the code. It now produces correct results, but the performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it.

ggerganov avatar Nov 06 '22 07:11 ggerganov

Hey, do you have any plans to add gpu support to whisper.cpp ?

Just curious what your plans are

On Sun, 6 Nov 2022, 5:10 pm Georgi Gerganov, @.***> wrote:

I found a machine with AVX-512 CPU and fixed the code. It now produces correct results, but the performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it.

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/pull/95#issuecomment-1304734572, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR6ZDVITCRDJVWMV6KMLWG5KWRANCNFSM6AAAAAARPESVCE . You are receiving this because you were mentioned.Message ID: @.***>

jaybinks avatar Nov 06 '22 07:11 jaybinks

Adding GPU support is not out of the question, but it's low priority atm. Here are some additional thoughts on this: https://github.com/ggerganov/whisper.cpp/discussions/126

ggerganov avatar Nov 06 '22 15:11 ggerganov