whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Idea: OpenCL via CLBlast?

Open misutoneko opened this issue 3 years ago • 6 comments

Hi,

Nice project! Thanks for your work, I wish I had better hw to make use of it :D

I haven't seen anyone mentioning CLBlast here? They actually provide a wrapper so that CLBlast can be used as a drop-in replacement for OpenBLAS. I've now tried this, it works and it was easy enough even for me :D But to get the best performance you'd need to adjust it a lot more I guess.

Here's my naïve patch in case anyone wants to play with it: whisper.cpp_CLBlast.patch.gz

Note that the patch simply replaces the existing OpenBLAS implementation. Also, CLBlast needs to be compiled with -DNETLIB=ON to enable the wrapper.

misutoneko avatar Nov 23 '22 12:11 misutoneko

Interesting. It recently got mentioned here by @StuartIanNaylor ; https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1323634188

What hardware are you running this at, if I may ask? The reason, I ask is because I wonder if you have proper Vulkan GPU support that could make use of; https://github.com/kpet/clvk as CLBlast appears to be supported for it; https://github.com/kpet/clvk/blob/main/docs/supported-applications.md

(Me also wondering what the status is of the Mesa3d Vulkan Broadcom driver for the RPi4 and if so, if the above might help as well for those systems. Most likely again more a proof of concept, but would be cool to have some sort of GPU enabled deep learning on the RPI4 to play with)

j1nx avatar Nov 23 '22 13:11 j1nx

I have a Rock5b with a Mali G610 which as a SBC & Soc is a bit bleeding edge as Mesa/Panfrost stops at Mali G57 Valhall OpenGL ES (v9) OpenGL 3.1

I have been doing some testing with the G610 as currently its just a Rockchip blob but using the OpenCL drivers with https://github.com/StuartIanNaylor/rock5b-wav2letter-bench which works and for ML is about equiv of the CPU for ML. The above is just the ArmNN examples with some fixes but you just need to point OpenCL to your OpenGL driver which I have forgot the .so of VC6 Also I think in the ArmNN example it doesn't use GpuAcc on the Pi not because it may not work but that it wasn't a Mali as it maybe just an OpenCL ML driver but that is just a guess, but the Pi has been OpenGL compliant for a while?

If it isn't you can prob run the tests in https://github.com/KhronosGroup/OpenCL-CTS if the ArmNN is a fail.

But yeah might be really interesting.

https://cnugteren.github.io/clblast/clblast.html

StuartIanNaylor avatar Nov 23 '22 13:11 StuartIanNaylor

If I install with -DNETLIB=ON then add your patch then performance is terrible. But if I check

cat /sys/devices/platform/fb000000.gpu/utilisation
7

7% load is all I am getting 15% at max so guess prob needs someone with much better knowledge than me. I don't think it will work anyway in just substituting CLBlast for OpenBlas as that is not the point as it takes someone with the knowledge of the model where paralism can take place with both CLBlast & OpenBlas se we are working on CPU & GPU at the same time and not merely substituting one for the other. I know from ArmNN that even with my current bad driver the G610MP4 & CPU are approx equivalent.

Which if I scaled my GPU from the 7% load to 100% then its x14 which divide the times by 14 and yeah about cpu equiv. So even if I got this installed correctly and tuned I would only ever be the same, but that isn't the point its to split out accross gpu/cpu in threads so comp is working in parallel...

Which is far beyond my ability but if clear parrallelism exists then with the G610 x2 is possible, well a bit less due inefficiences and that the code maybe similar to the load of ArmNN that provided about 7% load on the cpu so stealing that. I am thinking because a transformer has a clear partition of encoder & decoder or even by layers as the TFLite delegate can do that maybe its possible but far out of my realms.

CPU OS Config Model Threads Load [ms] Encode [ms]
NEON BLAS tiny 8 236.65 35018.07
NEON BLAS base 8 335.71 67945.20
NEON BLAS small 8 641.83 263145.69

Whilst normal optimised cpu

CPU OS Config Model Threads Load [ms] Encode [ms]
rk3588 Debian11 NEON tiny 8 232.45 2768.78
rk3588 Debian11 NEON base 8 308.36 6374.82
rk3588 Debian11 NEON small 8 626.23 25784.05
rk3588 Debian11 NEON medium 8 1667.23 86026.82
rk3588 Debian11 NEON large 8 4307.16 161328.59

StuartIanNaylor avatar Nov 23 '22 15:11 StuartIanNaylor

Thanks! Yes you're right it needs some serious work. In fact CLBlast docs recommend against using the wrapper because of hampered performance. It's really just a convenience to get already existing code running. But I just wanted to get the idea out since there are folks out there that might be able to do something about it :D

As I mentioned I'm a bit hw challenged myself but sure, Vulkan would also be good if the hw supports it. CLBlast can apparently be used even with OpenCL 1.1 level GPUs, so that'd be part of the charm for anyone stuck with the older stuff. I run the tests with GTX660, which I don't think even has 2GB of VRAM because it can't do the small model without coredumping :( The base.en model seems fine however.

EDIT: It seems this GPU only has 1.5GB of VRAM and that causes even the base.en model to crash sometimes (often times). So yeah, for older GPUs to be feasible, VRAM usage should be lowered quite a lot. On the bright side, I've noticed this AVX/AVX2 support really helps a lot! It means I can now use large model with just the CPU. Fortunately my CPU was a slightly more modern model than was the GPU ;)

misutoneko avatar Nov 23 '22 15:11 misutoneko

PS the ability is there as I looked at my power meter after the test that was running near 10 watts with confusion and then remembered on my other screen that had power saved I was still running the streaming version of whisper :) So unknowing I did have x2 instances running at the same time cpu/gpu.

I am more interested in embedded as my results are not bad as running whisper takes about 5watts whilst the GPU could be as low as 1.5watt as seems about 1/3 when running similar taks.

I have forgot what my RTX3050 got but from the full version of whisper as I got it because its only 140watt! In nvidia's crazy wattage world. Its amazing what the likes of the M1 and even $150 embedded SBC such as the RK3588 Rock5b are providing with the watts they use and what @ggerganov has running.

StuartIanNaylor avatar Nov 23 '22 16:11 StuartIanNaylor

For me, CLBlast provides a ~12.5% speedup compared to vanilla whisper.cpp

Vanilla whisper:

whisper_print_timings:     fallbacks =  11 p /  20 h
whisper_print_timings:     load time =   184.15 ms
whisper_print_timings:      mel time =  1010.46 ms
whisper_print_timings:   sample time =  2715.31 ms /  2306 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 23960.04 ms /    11 runs ( 2178.19 ms per run)
whisper_print_timings:   decode time = 17336.37 ms /  2266 runs (    7.65 ms per run)
whisper_print_timings:    total time = 45225.82 ms

CLBlast:

whisper_print_timings:     fallbacks =   8 p /  23 h
whisper_print_timings:     load time =    93.86 ms
whisper_print_timings:      mel time =   950.52 ms
whisper_print_timings:   sample time =  2599.31 ms /  2202 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 20387.99 ms /    11 runs ( 1853.45 ms per run)
whisper_print_timings:   decode time = 16615.79 ms /  2170 runs (    7.66 ms per run)
whisper_print_timings:    total time = 40669.39 ms


OpenBLAS:
whisper_print_timings:     fallbacks =  33 p /  77 h
whisper_print_timings:     load time =   132.15 ms
whisper_print_timings:      mel time =   948.88 ms
whisper_print_timings:   sample time =  9583.82 ms /  7918 runs (    1.21 ms per run)
whisper_print_timings:   encode time = 113946.63 ms /    35 runs ( 3255.62 ms per run)
whisper_print_timings:   decode time = 121515.83 ms /  7790 runs (   15.60 ms per run)
whisper_print_timings:    total time = 246182.56 ms

Test configuration:

  • i5-8365U / UHD Graphics 620, Arch Linux
  • whisper.cpp 1.2.0

marmistrz avatar Mar 04 '23 11:03 marmistrz

For me, CLBlast provides a ~12.5% speedup compared to vanilla whisper.cpp

Vanilla whisper:

whisper_print_timings:     fallbacks =  11 p /  20 h
whisper_print_timings:     load time =   184.15 ms
whisper_print_timings:      mel time =  1010.46 ms
whisper_print_timings:   sample time =  2715.31 ms /  2306 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 23960.04 ms /    11 runs ( 2178.19 ms per run)
whisper_print_timings:   decode time = 17336.37 ms /  2266 runs (    7.65 ms per run)
whisper_print_timings:    total time = 45225.82 ms

CLBlast:

whisper_print_timings:     fallbacks =   8 p /  23 h
whisper_print_timings:     load time =    93.86 ms
whisper_print_timings:      mel time =   950.52 ms
whisper_print_timings:   sample time =  2599.31 ms /  2202 runs (    1.18 ms per run)
whisper_print_timings:   encode time = 20387.99 ms /    11 runs ( 1853.45 ms per run)
whisper_print_timings:   decode time = 16615.79 ms /  2170 runs (    7.66 ms per run)
whisper_print_timings:    total time = 40669.39 ms


OpenBLAS:
whisper_print_timings:     fallbacks =  33 p /  77 h
whisper_print_timings:     load time =   132.15 ms
whisper_print_timings:      mel time =   948.88 ms
whisper_print_timings:   sample time =  9583.82 ms /  7918 runs (    1.21 ms per run)
whisper_print_timings:   encode time = 113946.63 ms /    35 runs ( 3255.62 ms per run)
whisper_print_timings:   decode time = 121515.83 ms /  7790 runs (   15.60 ms per run)
whisper_print_timings:    total time = 246182.56 ms

Test configuration:

  • i5-8365U / UHD Graphics 620, Arch Linux
  • whisper.cpp 1.2.0

That's odd. Mine is opposite. It's two times slower than vanilla .... I am Intel UHD 630 Windows 11.

ilovefreesw avatar Aug 10 '23 12:08 ilovefreesw

Out of interest what's the command people are using to get the above stats, I'm looking at various options via CLBlast and would be interested to be able to provide comparable perf feedback :)

nullr0ute avatar Dec 01 '23 17:12 nullr0ute