llama.cpp Add NVIDIA cuBLAS support

Adds support for NVIDIA cuBLAS for batched operations. In my system this is significantly faster than OpenBLAS.

Build with LLAMA_CUBLAS:

make clean && LLAMA_CUBLAS=1 make

Perplexity seconds per pass (i9 9900k, RTX 3080 10GB)

	7B q4_0	7B f16	7B f32
cuBLAS	8.92	5.24	7.70
OpenBLAS	22.64	24.85	18.18
No BLAS	26.39	30.35	54.33

Still missing cmake and Windows support, any contributions are welcome

Apr 18 '23 16:04 slaren

I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.

Apr 18 '23 16:04 rabidcopy

I haven't completed a full run yet, but with 7B q4_0, the perplexity of the first iterations is identical to OpenBLAS. It will probably be higher in f16xf32 because instead of converting to f32xf32, I convert to f16xf16.

Apr 18 '23 17:04 slaren

Perplexity with 7B q4_0 is 6.2838

./perplexity -m models/7B/ggml-model-q4_0.bin -f wikitext-2-raw/wiki.test.raw -t 8 main: seed = 1681837585 llama.cpp: loading model from models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 5809.32 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | perplexity : calculating perplexity over 655 chunks, batch_size=512 9.13 seconds per pass - ETA 1.66 hours [1]4.3798,[2]4.9554,[3]5.8269,[4]6.4695,[5]6.5438,[6]6.5414,[7]6.7175,[8]6.8070,[9]7.1756,[10]7.4121,[11]7.6567,[12]7.6957,[13]7.6057,[14]7.6821,[15]7.9367,[16]7.5419,[17]7.4189,[18]7.3798,[19]7.0077,[20]6.9947,[21]6.8969,[22]6.7124,[23]6.6743,[24]6.5868,[25]6.5871,[26]6.4149,[27]6.2349,[28]6.1341,[29]6.0499,[30]5.8939,[31]5.8660,[32]5.8840,[33]5.8190,[34]5.8538,[35]5.8796,[36]5.9233,[37]5.9272,[38]5.9444,[39]5.9825,[40]6.0413,[41]6.0483,[42]6.0827,[43]6.0398,[44]6.0945,[45]6.0989,[46]6.0730,[47]6.0968,[48]6.0675,[49]6.0746,[50]6.0352,[51]6.0311,[52]6.0201,[53]6.0642,[54]6.0477,[55]6.0251,[56]6.0595,[57]6.0826,[58]6.1044,[59]6.1183,[60]6.1648,[61]6.1537,[62]6.2167,[63]6.2503,[64]6.2654,[65]6.3119,[66]6.3221,[67]6.3402,[68]6.3542,[69]6.3791,[70]6.4114,[71]6.4328,[72]6.4626,[73]6.5278,[74]6.5331,[75]6.5475,[76]6.5638,[77]6.5771,[78]6.5619,[79]6.5915,[80]6.5840,[81]6.5968,[82]6.6005,[83]6.5468,[84]6.5323,[85]6.5209,[86]6.4998,[87]6.4344,[88]6.4060,[89]6.3854,[90]6.3688,[91]6.3949,[92]6.3910,[93]6.3936,[94]6.3911,[95]6.4199,[96]6.4178,[97]6.4106,[98]6.4036,[99]6.3896,[100]6.3896,[101]6.4155,[102]6.4092,[103]6.4309,[104]6.4377,[105]6.4362,[106]6.4539,[107]6.4526,[108]6.4649,[109]6.4596,[110]6.4551,[111]6.4780,[112]6.4970,[113]6.4984,[114]6.4950,[115]6.5033,[116]6.4959,[117]6.5014,[118]6.5299,[119]6.5508,[120]6.5872,[121]6.6035,[122]6.6283,[123]6.6673,[124]6.6850,[125]6.6763,[126]6.7154,[127]6.7524,[128]6.7799,[129]6.7630,[130]6.7725,[131]6.7673,[132]6.7585,[133]6.7457,[134]6.7569,[135]6.7534,[136]6.7402,[137]6.7322,[138]6.7151,[139]6.7035,[140]6.7005,[141]6.6707,[142]6.6659,[143]6.6380,[144]6.6179,[145]6.6092,[146]6.5957,[147]6.6032,[148]6.6055,[149]6.5994,[150]6.5953,[151]6.5965,[152]6.5870,[153]6.5703,[154]6.5613,[155]6.5681,[156]6.5630,[157]6.5814,[158]6.5849,[159]6.5891,[160]6.5917,[161]6.6041,[162]6.5739,[163]6.5619,[164]6.5357,[165]6.5039,[166]6.4751,[167]6.4378,[168]6.4051,[169]6.3916,[170]6.3791,[171]6.3503,[172]6.3322,[173]6.3136,[174]6.2829,[175]6.2608,[176]6.2505,[177]6.2295,[178]6.2059,[179]6.1888,[180]6.1798,[181]6.1574,[182]6.1382,[183]6.1240,[184]6.1238,[185]6.1165,[186]6.1183,[187]6.1237,[188]6.1200,[189]6.1384,[190]6.1393,[191]6.1597,[192]6.1761,[193]6.1938,[194]6.2055,[195]6.2264,[196]6.2434,[197]6.2655,[198]6.2811,[199]6.2840,[200]6.2886,[201]6.2844,[202]6.3049,[203]6.3116,[204]6.3115,[205]6.3224,[206]6.3302,[207]6.3262,[208]6.3347,[209]6.3399,[210]6.3450,[211]6.3547,[212]6.3621,[213]6.3727,[214]6.3763,[215]6.3803,[216]6.3951,[217]6.4130,[218]6.4265,[219]6.4267,[220]6.4231,[221]6.4169,[222]6.4133,[223]6.4025,[224]6.3958,[225]6.3911,[226]6.4126,[227]6.4213,[228]6.4271,[229]6.4338,[230]6.4294,[231]6.4463,[232]6.4332,[233]6.4161,[234]6.4004,[235]6.3846,[236]6.3768,[237]6.3664,[238]6.3698,[239]6.3536,[240]6.3433,[241]6.3466,[242]6.3504,[243]6.3488,[244]6.3369,[245]6.3343,[246]6.3221,[247]6.3098,[248]6.3030,[249]6.3010,[250]6.3057,[251]6.2981,[252]6.2947,[253]6.2845,[254]6.2804,[255]6.2688,[256]6.2497,[257]6.2386,[258]6.2299,[259]6.2279,[260]6.2197,[261]6.2154,[262]6.2095,[263]6.2050,[264]6.1858,[265]6.1850,[266]6.1835,[267]6.1766,[268]6.1863,[269]6.1843,[270]6.1850,[271]6.1928,[272]6.1974,[273]6.1969,[274]6.1984,[275]6.2073,[276]6.2128,[277]6.2289,[278]6.2397,[279]6.2483,[280]6.2519,[281]6.2617,[282]6.2678,[283]6.2825,[284]6.2903,[285]6.2997,[286]6.3144,[287]6.3138,[288]6.3199,[289]6.3107,[290]6.2956,[291]6.2802,[292]6.2644,[293]6.2505,[294]6.2530,[295]6.2524,[296]6.2567,[297]6.2554,[298]6.2579,[299]6.2551,[300]6.2439,[301]6.2440,[302]6.2360,[303]6.2283,[304]6.2204,[305]6.2180,[306]6.2048,[307]6.2072,[308]6.2104,[309]6.1941,[310]6.1880,[311]6.1816,[312]6.1839,[313]6.1782,[314]6.1770,[315]6.1604,[316]6.1562,[317]6.1395,[318]6.1179,[319]6.1298,[320]6.1429,[321]6.1466,[322]6.1422,[323]6.1356,[324]6.1331,[325]6.1431,[326]6.1430,[327]6.1451,[328]6.1494,[329]6.1554,[330]6.1579,[331]6.1703,[332]6.1672,[333]6.1741,[334]6.1682,[335]6.1618,[336]6.1655,[337]6.1625,[338]6.1612,[339]6.1555,[340]6.1512,[341]6.1589,[342]6.1614,[343]6.1669,[344]6.1668,[345]6.1667,[346]6.1638,[347]6.1686,[348]6.1728,[349]6.1746,[350]6.1712,[351]6.1717,[352]6.1717,[353]6.1665,[354]6.1664,[355]6.1719,[356]6.1749,[357]6.1712,[358]6.1802,[359]6.1833,[360]6.1795,[361]6.1791,[362]6.1858,[363]6.1970,[364]6.2035,[365]6.2093,[366]6.2100,[367]6.2188,[368]6.2166,[369]6.2175,[370]6.2185,[371]6.2125,[372]6.2178,[373]6.2234,[374]6.2221,[375]6.2217,[376]6.2301,[377]6.2252,[378]6.2278,[379]6.2338,[380]6.2254,[381]6.2211,[382]6.2154,[383]6.2144,[384]6.2137,[385]6.2124,[386]6.2119,[387]6.2111,[388]6.2066,[389]6.2012,[390]6.1943,[391]6.1862,[392]6.1822,[393]6.1803,[394]6.1828,[395]6.1812,[396]6.1738,[397]6.1814,[398]6.1852,[399]6.1935,[400]6.1931,[401]6.1945,[402]6.1950,[403]6.1969,[404]6.2032,[405]6.1937,[406]6.1903,[407]6.1895,[408]6.1905,[409]6.2029,[410]6.2139,[411]6.2264,[412]6.2427,[413]6.2542,[414]6.2618,[415]6.2670,[416]6.2750,[417]6.2881,[418]6.2916,[419]6.2990,[420]6.3077,[421]6.3197,[422]6.3255,[423]6.3326,[424]6.3446,[425]6.3537,[426]6.3602,[427]6.3647,[428]6.3730,[429]6.3775,[430]6.3865,[431]6.4011,[432]6.4054,[433]6.4041,[434]6.3995,[435]6.4002,[436]6.4027,[437]6.4121,[438]6.4200,[439]6.4164,[440]6.4159,[441]6.4108,[442]6.4099,[443]6.4112,[444]6.4115,[445]6.4095,[446]6.4118,[447]6.4147,[448]6.4191,[449]6.4164,[450]6.4167,[451]6.4124,[452]6.4006,[453]6.3922,[454]6.3862,[455]6.3869,[456]6.3917,[457]6.3934,[458]6.3912,[459]6.3922,[460]6.4009,[461]6.3981,[462]6.3965,[463]6.4016,[464]6.4007,[465]6.3976,[466]6.3895,[467]6.3898,[468]6.3897,[469]6.3919,[470]6.3924,[471]6.3876,[472]6.3923,[473]6.3866,[474]6.3880,[475]6.3821,[476]6.3844,[477]6.3773,[478]6.3764,[479]6.3827,[480]6.3879,[481]6.3899,[482]6.3854,[483]6.3813,[484]6.3835,[485]6.3818,[486]6.3763,[487]6.3763,[488]6.3744,[489]6.3694,[490]6.3667,[491]6.3637,[492]6.3579,[493]6.3549,[494]6.3531,[495]6.3528,[496]6.3493,[497]6.3440,[498]6.3422,[499]6.3372,[500]6.3275,[501]6.3206,[502]6.3204,[503]6.3202,[504]6.3109,[505]6.3134,[506]6.3143,[507]6.3081,[508]6.3038,[509]6.3027,[510]6.3067,[511]6.3113,[512]6.3148,[513]6.3166,[514]6.3233,[515]6.3177,[516]6.3169,[517]6.3180,[518]6.3181,[519]6.3211,[520]6.3238,[521]6.3255,[522]6.3284,[523]6.3294,[524]6.3357,[525]6.3394,[526]6.3406,[527]6.3426,[528]6.3372,[529]6.3377,[530]6.3329,[531]6.3319,[532]6.3368,[533]6.3391,[534]6.3372,[535]6.3395,[536]6.3341,[537]6.3318,[538]6.3366,[539]6.3378,[540]6.3418,[541]6.3426,[542]6.3433,[543]6.3447,[544]6.3459,[545]6.3437,[546]6.3444,[547]6.3399,[548]6.3344,[549]6.3345,[550]6.3318,[551]6.3280,[552]6.3260,[553]6.3217,[554]6.3195,[555]6.3166,[556]6.3163,[557]6.3186,[558]6.3147,[559]6.3142,[560]6.3137,[561]6.3139,[562]6.3120,[563]6.3120,[564]6.3164,[565]6.3181,[566]6.3178,[567]6.3155,[568]6.3161,[569]6.3144,[570]6.3170,[571]6.3176,[572]6.3186,[573]6.3188,[574]6.3151,[575]6.3147,[576]6.3146,[577]6.3135,[578]6.3114,[579]6.3122,[580]6.3056,[581]6.3018,[582]6.3009,[583]6.3016,[584]6.3020,[585]6.2943,[586]6.2875,[587]6.2878,[588]6.2928,[589]6.2985,[590]6.3016,[591]6.3037,[592]6.3022,[593]6.2985,[594]6.2996,[595]6.2973,[596]6.3011,[597]6.2987,[598]6.2949,[599]6.2971,[600]6.2969,[601]6.2954,[602]6.2972,[603]6.3001,[604]6.3012,[605]6.3044,[606]6.3065,[607]6.3048,[608]6.3013,[609]6.3019,[610]6.3056,[611]6.3038,[612]6.3063,[613]6.3026,[614]6.2975,[615]6.2898,[616]6.2928,[617]6.2865,[618]6.2814,[619]6.2757,[620]6.2615,[621]6.2543,[622]6.2525,[623]6.2540,[624]6.2545,[625]6.2544,[626]6.2529,[627]6.2550,[628]6.2555,[629]6.2553,[630]6.2587,[631]6.2650,[632]6.2704,[633]6.2687,[634]6.2721,[635]6.2726,[636]6.2694,[637]6.2659,[638]6.2686,[639]6.2657,[640]6.2667,[641]6.2669,[642]6.2738,[643]6.2760,[644]6.2772,[645]6.2751,[646]6.2793,[647]6.2755,[648]6.2762,[649]6.2763,[650]6.2801,[651]6.2858,[652]6.2865,[653]6.2908,[654]6.2844,[655]6.2838,

llama_print_timings: load time = 11045.83 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: prompt eval time = 5755570.69 ms / 335360 tokens ( 17.16 ms per token) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: total time = 5793144.42 ms

Apr 18 '23 20:04 slaren

Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?

Apr 18 '23 20:04 slaren

Perplexity with 7B q4_0 is 6.2838

This is the expected value

Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?

Yes

Apr 18 '23 20:04 ggerganov

Tested successfully under windows. Build with cmake .. -DLLAMA_CUBLAS=ON. The CUDA Tookit is available from https://developer.nvidia.com/cuda-downloads.

Though I would appreciate a review on the cmake changes, I have no idea how any of that works.

Apr 18 '23 21:04 slaren

Perplexity with 7B q4_0 is 6.2838

This is the expected value

Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?

Yes

hmm, cmake on ubuntu 20.04 shipps 3.16 by default but even the gh action runner uses 3.26

Apr 18 '23 22:04 Green-Sky

Is it possible to make the CMake version depend on LLAMA_CUBLAS ?

Apr 18 '23 22:04 ggerganov

Is it possible to make the CMake version depend on LLAMA_CUBLAS ?

the cmake_minimum_required() call looks like a function you could call anywhere. @slaren can you try just calling it again with a higher number in the conditional?

Apr 18 '23 22:04 Green-Sky

That seems to work, updated.

Apr 18 '23 22:04 slaren

That seems to work, updated.

$ cmake .
CMake Error at CMakeLists.txt:147 (cmake_minimum_required):
  CMake 3.17 or higher is required.  You are running version 3.16.3

yup, perfect

Apr 19 '23 00:04 Green-Sky

Very exciting. Can't wait to try it out 🤩

Apr 19 '23 02:04 KyTiXo

Just wondering for all those who have tried, how much speedup do you get in the batched prompt eval timings vs openblas (not perplexity calculations)? Would be good to benchmark against a fixed context size, say 1024 tokens.

Apr 19 '23 09:04 LostRuins

I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.

@rabidcopy our newest Clblast implementation does the dequantization on GPU as well, which actually provides much better speeds, since a major bottleneck was actually transferring the data on and off the GPU after the mat mul. That's why I am curious about how fast this might compare.

Apr 19 '23 09:04 LostRuins

I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.

@rabidcopy our newest Clblast implementation does the dequantization on GPU as well, which actually provides much better speeds, since a major bottleneck was actually transferring the data on and off the GPU after the mat mul. That's why I am curious about how fast this might compare.

Found a comparison someone did between llama.cpp with cuBLAS and koboldcpp with CLBlast. Maybe it would be worth implementing CLBlast over here as well? (Sorry, wasn't aware there was further improvements on CLBLast in koboldcpp since I last compared on my own hardware.)

make clean && LLAMA_OPENBLAS=1 make -j && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt


llama_print_timings:        load time = 27152.17 ms
llama_print_timings:      sample time =    23.20 ms /    50 runs   (    0.46 ms per run)
llama_print_timings: prompt eval time = 25333.24 ms /   399 tokens (   63.49 ms per token)
llama_print_timings:        eval time = 10619.50 ms /    49 runs   (  216.72 ms per run)
llama_print_timings:       total time = 37795.51 ms

make clean && LLAMA_CUBLAS=1 make -j && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt


llama_print_timings:        load time = 12408.19 ms
llama_print_timings:      sample time =    22.31 ms /    50 runs   (    0.45 ms per run)
llama_print_timings: prompt eval time = 10300.15 ms /   399 tokens (   25.81 ms per token)
llama_print_timings:        eval time = 10533.55 ms /    49 runs   (  214.97 ms per run)
llama_print_timings:       total time = 22964.58 ms

make clean && LLAMA_CLBLAST=1 make -j main && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt


llama_print_timings:        load time = 13699.05 ms
llama_print_timings:      sample time =    22.91 ms /    50 runs   (    0.46 ms per run)
llama_print_timings: prompt eval time = 11899.14 ms /   399 tokens (   29.82 ms per token)
llama_print_timings:        eval time = 10496.48 ms /    49 runs   (  214.21 ms per run)
llama_print_timings:       total time = 24218.98 ms

Apr 19 '23 15:04 rabidcopy

@LostRuins I have a thread going on in the discussions where people are trying out the Kobold clblast implementation. On my integrated Intel HD530 clblast prompt ingestion was twice as slow as openblas but someone with a Nvidia 3060 reported a 50% improvement on his end.

Apr 19 '23 16:04 ghost

Here are benchmarks for my system

Note: This is with the non-quantized 13B-16bit model

cpu ryzen 7900x
gpu 1080ti
ram 64GiB@5200

With cublas

make clean && LLAMA_CUBLAS=1 make -j && ./main --mlock -t 8 -b 512 -m ./models/13B/ggml-model-f16.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt

llama_print_timings:        load time = 20691.75 ms
llama_print_timings:      sample time =    16.89 ms /    50 runs   (    0.34 ms per run)
llama_print_timings: prompt eval time = 18748.63 ms /   373 tokens (   50.26 ms per token)
llama_print_timings:        eval time = 24565.83 ms /    49 runs   (  501.34 ms per run)
llama_print_timings:       total time = 45275.08 ms

With OpenBLAS

make clean && LLAMA_OPENBLAS=1 make -j && ./main --mlock -t 8 -b 512 -m ./models/13B/ggml-model-f16.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt

llama_print_timings:        load time = 43043.43 ms
llama_print_timings:      sample time =    17.31 ms /    50 runs   (    0.35 ms per run)
llama_print_timings: prompt eval time = 27472.01 ms /   373 tokens (   73.65 ms per token)
llama_print_timings:        eval time = 24480.05 ms /    49 runs   (  499.59 ms per run)
llama_print_timings:       total time = 67541.45 ms

So that's a ~48% total time speedup, super nice!

Apr 19 '23 21:04 Azeirah

cc @ravenscroftj Might be interested in adding cuBLAS support to turbopilot to speed-up prompt processing. This change works with low-VRAM cards even for big models and is optionally enabled with GGML_USE_CUBLAS compile flag:

https://github.com/ggerganov/llama.cpp/blob/master/Makefile#L107-L115

Will be available in the ggml repo soon as well

Apr 22 '23 09:04 ggerganov

oh that is awesome thanks for the tag @ggerganov - will definitely be looking at adding this as making suggestions much faster will make turbopilot much more usable!

Apr 22 '23 15:04 ravenscroftj

llama.cpp llama.cpp copied to clipboard

Add NVIDIA cuBLAS support

llama.cpp
llama.cpp copied to clipboard