llama.cpp
llama.cpp copied to clipboard
Add NVIDIA cuBLAS support
Adds support for NVIDIA cuBLAS for batched operations. In my system this is significantly faster than OpenBLAS.
Build with LLAMA_CUBLAS
:
make clean && LLAMA_CUBLAS=1 make
Perplexity seconds per pass (i9 9900k, RTX 3080 10GB)
7B q4_0 | 7B f16 | 7B f32 | |
---|---|---|---|
cuBLAS | 8.92 | 5.24 | 7.70 |
OpenBLAS | 22.64 | 24.85 | 18.18 |
No BLAS | 26.39 | 30.35 | 54.33 |
Still missing cmake and Windows support, any contributions are welcome
I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.
I haven't completed a full run yet, but with 7B q4_0, the perplexity of the first iterations is identical to OpenBLAS. It will probably be higher in f16xf32 because instead of converting to f32xf32, I convert to f16xf16.
Perplexity with 7B q4_0 is 6.2838
system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | perplexity : calculating perplexity over 655 chunks, batch_size=512 9.13 seconds per pass - ETA 1.66 hours [1]4.3798,[2]4.9554,[3]5.8269,[4]6.4695,[5]6.5438,[6]6.5414,[7]6.7175,[8]6.8070,[9]7.1756,[10]7.4121,[11]7.6567,[12]7.6957,[13]7.6057,[14]7.6821,[15]7.9367,[16]7.5419,[17]7.4189,[18]7.3798,[19]7.0077,[20]6.9947,[21]6.8969,[22]6.7124,[23]6.6743,[24]6.5868,[25]6.5871,[26]6.4149,[27]6.2349,[28]6.1341,[29]6.0499,[30]5.8939,[31]5.8660,[32]5.8840,[33]5.8190,[34]5.8538,[35]5.8796,[36]5.9233,[37]5.9272,[38]5.9444,[39]5.9825,[40]6.0413,[41]6.0483,[42]6.0827,[43]6.0398,[44]6.0945,[45]6.0989,[46]6.0730,[47]6.0968,[48]6.0675,[49]6.0746,[50]6.0352,[51]6.0311,[52]6.0201,[53]6.0642,[54]6.0477,[55]6.0251,[56]6.0595,[57]6.0826,[58]6.1044,[59]6.1183,[60]6.1648,[61]6.1537,[62]6.2167,[63]6.2503,[64]6.2654,[65]6.3119,[66]6.3221,[67]6.3402,[68]6.3542,[69]6.3791,[70]6.4114,[71]6.4328,[72]6.4626,[73]6.5278,[74]6.5331,[75]6.5475,[76]6.5638,[77]6.5771,[78]6.5619,[79]6.5915,[80]6.5840,[81]6.5968,[82]6.6005,[83]6.5468,[84]6.5323,[85]6.5209,[86]6.4998,[87]6.4344,[88]6.4060,[89]6.3854,[90]6.3688,[91]6.3949,[92]6.3910,[93]6.3936,[94]6.3911,[95]6.4199,[96]6.4178,[97]6.4106,[98]6.4036,[99]6.3896,[100]6.3896,[101]6.4155,[102]6.4092,[103]6.4309,[104]6.4377,[105]6.4362,[106]6.4539,[107]6.4526,[108]6.4649,[109]6.4596,[110]6.4551,[111]6.4780,[112]6.4970,[113]6.4984,[114]6.4950,[115]6.5033,[116]6.4959,[117]6.5014,[118]6.5299,[119]6.5508,[120]6.5872,[121]6.6035,[122]6.6283,[123]6.6673,[124]6.6850,[125]6.6763,[126]6.7154,[127]6.7524,[128]6.7799,[129]6.7630,[130]6.7725,[131]6.7673,[132]6.7585,[133]6.7457,[134]6.7569,[135]6.7534,[136]6.7402,[137]6.7322,[138]6.7151,[139]6.7035,[140]6.7005,[141]6.6707,[142]6.6659,[143]6.6380,[144]6.6179,[145]6.6092,[146]6.5957,[147]6.6032,[148]6.6055,[149]6.5994,[150]6.5953,[151]6.5965,[152]6.5870,[153]6.5703,[154]6.5613,[155]6.5681,[156]6.5630,[157]6.5814,[158]6.5849,[159]6.5891,[160]6.5917,[161]6.6041,[162]6.5739,[163]6.5619,[164]6.5357,[165]6.5039,[166]6.4751,[167]6.4378,[168]6.4051,[169]6.3916,[170]6.3791,[171]6.3503,[172]6.3322,[173]6.3136,[174]6.2829,[175]6.2608,[176]6.2505,[177]6.2295,[178]6.2059,[179]6.1888,[180]6.1798,[181]6.1574,[182]6.1382,[183]6.1240,[184]6.1238,[185]6.1165,[186]6.1183,[187]6.1237,[188]6.1200,[189]6.1384,[190]6.1393,[191]6.1597,[192]6.1761,[193]6.1938,[194]6.2055,[195]6.2264,[196]6.2434,[197]6.2655,[198]6.2811,[199]6.2840,[200]6.2886,[201]6.2844,[202]6.3049,[203]6.3116,[204]6.3115,[205]6.3224,[206]6.3302,[207]6.3262,[208]6.3347,[209]6.3399,[210]6.3450,[211]6.3547,[212]6.3621,[213]6.3727,[214]6.3763,[215]6.3803,[216]6.3951,[217]6.4130,[218]6.4265,[219]6.4267,[220]6.4231,[221]6.4169,[222]6.4133,[223]6.4025,[224]6.3958,[225]6.3911,[226]6.4126,[227]6.4213,[228]6.4271,[229]6.4338,[230]6.4294,[231]6.4463,[232]6.4332,[233]6.4161,[234]6.4004,[235]6.3846,[236]6.3768,[237]6.3664,[238]6.3698,[239]6.3536,[240]6.3433,[241]6.3466,[242]6.3504,[243]6.3488,[244]6.3369,[245]6.3343,[246]6.3221,[247]6.3098,[248]6.3030,[249]6.3010,[250]6.3057,[251]6.2981,[252]6.2947,[253]6.2845,[254]6.2804,[255]6.2688,[256]6.2497,[257]6.2386,[258]6.2299,[259]6.2279,[260]6.2197,[261]6.2154,[262]6.2095,[263]6.2050,[264]6.1858,[265]6.1850,[266]6.1835,[267]6.1766,[268]6.1863,[269]6.1843,[270]6.1850,[271]6.1928,[272]6.1974,[273]6.1969,[274]6.1984,[275]6.2073,[276]6.2128,[277]6.2289,[278]6.2397,[279]6.2483,[280]6.2519,[281]6.2617,[282]6.2678,[283]6.2825,[284]6.2903,[285]6.2997,[286]6.3144,[287]6.3138,[288]6.3199,[289]6.3107,[290]6.2956,[291]6.2802,[292]6.2644,[293]6.2505,[294]6.2530,[295]6.2524,[296]6.2567,[297]6.2554,[298]6.2579,[299]6.2551,[300]6.2439,[301]6.2440,[302]6.2360,[303]6.2283,[304]6.2204,[305]6.2180,[306]6.2048,[307]6.2072,[308]6.2104,[309]6.1941,[310]6.1880,[311]6.1816,[312]6.1839,[313]6.1782,[314]6.1770,[315]6.1604,[316]6.1562,[317]6.1395,[318]6.1179,[319]6.1298,[320]6.1429,[321]6.1466,[322]6.1422,[323]6.1356,[324]6.1331,[325]6.1431,[326]6.1430,[327]6.1451,[328]6.1494,[329]6.1554,[330]6.1579,[331]6.1703,[332]6.1672,[333]6.1741,[334]6.1682,[335]6.1618,[336]6.1655,[337]6.1625,[338]6.1612,[339]6.1555,[340]6.1512,[341]6.1589,[342]6.1614,[343]6.1669,[344]6.1668,[345]6.1667,[346]6.1638,[347]6.1686,[348]6.1728,[349]6.1746,[350]6.1712,[351]6.1717,[352]6.1717,[353]6.1665,[354]6.1664,[355]6.1719,[356]6.1749,[357]6.1712,[358]6.1802,[359]6.1833,[360]6.1795,[361]6.1791,[362]6.1858,[363]6.1970,[364]6.2035,[365]6.2093,[366]6.2100,[367]6.2188,[368]6.2166,[369]6.2175,[370]6.2185,[371]6.2125,[372]6.2178,[373]6.2234,[374]6.2221,[375]6.2217,[376]6.2301,[377]6.2252,[378]6.2278,[379]6.2338,[380]6.2254,[381]6.2211,[382]6.2154,[383]6.2144,[384]6.2137,[385]6.2124,[386]6.2119,[387]6.2111,[388]6.2066,[389]6.2012,[390]6.1943,[391]6.1862,[392]6.1822,[393]6.1803,[394]6.1828,[395]6.1812,[396]6.1738,[397]6.1814,[398]6.1852,[399]6.1935,[400]6.1931,[401]6.1945,[402]6.1950,[403]6.1969,[404]6.2032,[405]6.1937,[406]6.1903,[407]6.1895,[408]6.1905,[409]6.2029,[410]6.2139,[411]6.2264,[412]6.2427,[413]6.2542,[414]6.2618,[415]6.2670,[416]6.2750,[417]6.2881,[418]6.2916,[419]6.2990,[420]6.3077,[421]6.3197,[422]6.3255,[423]6.3326,[424]6.3446,[425]6.3537,[426]6.3602,[427]6.3647,[428]6.3730,[429]6.3775,[430]6.3865,[431]6.4011,[432]6.4054,[433]6.4041,[434]6.3995,[435]6.4002,[436]6.4027,[437]6.4121,[438]6.4200,[439]6.4164,[440]6.4159,[441]6.4108,[442]6.4099,[443]6.4112,[444]6.4115,[445]6.4095,[446]6.4118,[447]6.4147,[448]6.4191,[449]6.4164,[450]6.4167,[451]6.4124,[452]6.4006,[453]6.3922,[454]6.3862,[455]6.3869,[456]6.3917,[457]6.3934,[458]6.3912,[459]6.3922,[460]6.4009,[461]6.3981,[462]6.3965,[463]6.4016,[464]6.4007,[465]6.3976,[466]6.3895,[467]6.3898,[468]6.3897,[469]6.3919,[470]6.3924,[471]6.3876,[472]6.3923,[473]6.3866,[474]6.3880,[475]6.3821,[476]6.3844,[477]6.3773,[478]6.3764,[479]6.3827,[480]6.3879,[481]6.3899,[482]6.3854,[483]6.3813,[484]6.3835,[485]6.3818,[486]6.3763,[487]6.3763,[488]6.3744,[489]6.3694,[490]6.3667,[491]6.3637,[492]6.3579,[493]6.3549,[494]6.3531,[495]6.3528,[496]6.3493,[497]6.3440,[498]6.3422,[499]6.3372,[500]6.3275,[501]6.3206,[502]6.3204,[503]6.3202,[504]6.3109,[505]6.3134,[506]6.3143,[507]6.3081,[508]6.3038,[509]6.3027,[510]6.3067,[511]6.3113,[512]6.3148,[513]6.3166,[514]6.3233,[515]6.3177,[516]6.3169,[517]6.3180,[518]6.3181,[519]6.3211,[520]6.3238,[521]6.3255,[522]6.3284,[523]6.3294,[524]6.3357,[525]6.3394,[526]6.3406,[527]6.3426,[528]6.3372,[529]6.3377,[530]6.3329,[531]6.3319,[532]6.3368,[533]6.3391,[534]6.3372,[535]6.3395,[536]6.3341,[537]6.3318,[538]6.3366,[539]6.3378,[540]6.3418,[541]6.3426,[542]6.3433,[543]6.3447,[544]6.3459,[545]6.3437,[546]6.3444,[547]6.3399,[548]6.3344,[549]6.3345,[550]6.3318,[551]6.3280,[552]6.3260,[553]6.3217,[554]6.3195,[555]6.3166,[556]6.3163,[557]6.3186,[558]6.3147,[559]6.3142,[560]6.3137,[561]6.3139,[562]6.3120,[563]6.3120,[564]6.3164,[565]6.3181,[566]6.3178,[567]6.3155,[568]6.3161,[569]6.3144,[570]6.3170,[571]6.3176,[572]6.3186,[573]6.3188,[574]6.3151,[575]6.3147,[576]6.3146,[577]6.3135,[578]6.3114,[579]6.3122,[580]6.3056,[581]6.3018,[582]6.3009,[583]6.3016,[584]6.3020,[585]6.2943,[586]6.2875,[587]6.2878,[588]6.2928,[589]6.2985,[590]6.3016,[591]6.3037,[592]6.3022,[593]6.2985,[594]6.2996,[595]6.2973,[596]6.3011,[597]6.2987,[598]6.2949,[599]6.2971,[600]6.2969,[601]6.2954,[602]6.2972,[603]6.3001,[604]6.3012,[605]6.3044,[606]6.3065,[607]6.3048,[608]6.3013,[609]6.3019,[610]6.3056,[611]6.3038,[612]6.3063,[613]6.3026,[614]6.2975,[615]6.2898,[616]6.2928,[617]6.2865,[618]6.2814,[619]6.2757,[620]6.2615,[621]6.2543,[622]6.2525,[623]6.2540,[624]6.2545,[625]6.2544,[626]6.2529,[627]6.2550,[628]6.2555,[629]6.2553,[630]6.2587,[631]6.2650,[632]6.2704,[633]6.2687,[634]6.2721,[635]6.2726,[636]6.2694,[637]6.2659,[638]6.2686,[639]6.2657,[640]6.2667,[641]6.2669,[642]6.2738,[643]6.2760,[644]6.2772,[645]6.2751,[646]6.2793,[647]6.2755,[648]6.2762,[649]6.2763,[650]6.2801,[651]6.2858,[652]6.2865,[653]6.2908,[654]6.2844,[655]6.2838,
llama_print_timings: load time = 11045.83 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: prompt eval time = 5755570.69 ms / 335360 tokens ( 17.16 ms per token) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: total time = 5793144.42 ms
Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?
Perplexity with 7B q4_0 is 6.2838
This is the expected value
Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?
Yes
Tested successfully under windows. Build with cmake .. -DLLAMA_CUBLAS=ON
. The CUDA Tookit is available from https://developer.nvidia.com/cuda-downloads.
Though I would appreciate a review on the cmake changes, I have no idea how any of that works.
Perplexity with 7B q4_0 is 6.2838
This is the expected value
Is FindCUDAToolkit a good reason to bump the CMake version to 3.17?
Yes
hmm, cmake on ubuntu 20.04 shipps 3.16 by default but even the gh action runner uses 3.26
Is it possible to make the CMake version depend on LLAMA_CUBLAS
?
Is it possible to make the CMake version depend on
LLAMA_CUBLAS
?
the cmake_minimum_required()
call looks like a function you could call anywhere. @slaren can you try just calling it again with a higher number in the conditional?
That seems to work, updated.
That seems to work, updated.
$ cmake .
CMake Error at CMakeLists.txt:147 (cmake_minimum_required):
CMake 3.17 or higher is required. You are running version 3.16.3
yup, perfect
Very exciting. Can't wait to try it out 🤩
Just wondering for all those who have tried, how much speedup do you get in the batched prompt eval timings vs openblas (not perplexity calculations)? Would be good to benchmark against a fixed context size, say 1024 tokens.
I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.
@rabidcopy our newest Clblast implementation does the dequantization on GPU as well, which actually provides much better speeds, since a major bottleneck was actually transferring the data on and off the GPU after the mat mul. That's why I am curious about how fast this might compare.
I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower than OpenBLAS in cases where the dGPU isn't that good or the CPU is just better. The speed up here with CuBLAS seems much more pronounced.
@rabidcopy our newest Clblast implementation does the dequantization on GPU as well, which actually provides much better speeds, since a major bottleneck was actually transferring the data on and off the GPU after the mat mul. That's why I am curious about how fast this might compare.
Found a comparison someone did between llama.cpp with cuBLAS and koboldcpp with CLBlast. Maybe it would be worth implementing CLBlast over here as well? (Sorry, wasn't aware there was further improvements on CLBLast in koboldcpp since I last compared on my own hardware.)
make clean && LLAMA_OPENBLAS=1 make -j && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt
llama_print_timings: load time = 27152.17 ms
llama_print_timings: sample time = 23.20 ms / 50 runs ( 0.46 ms per run)
llama_print_timings: prompt eval time = 25333.24 ms / 399 tokens ( 63.49 ms per token)
llama_print_timings: eval time = 10619.50 ms / 49 runs ( 216.72 ms per run)
llama_print_timings: total time = 37795.51 ms
make clean && LLAMA_CUBLAS=1 make -j && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt
llama_print_timings: load time = 12408.19 ms
llama_print_timings: sample time = 22.31 ms / 50 runs ( 0.45 ms per run)
llama_print_timings: prompt eval time = 10300.15 ms / 399 tokens ( 25.81 ms per token)
llama_print_timings: eval time = 10533.55 ms / 49 runs ( 214.97 ms per run)
llama_print_timings: total time = 22964.58 ms
make clean && LLAMA_CLBLAST=1 make -j main && ./main --no-mmap -t 8 -b 512 -m ./models/llama-13b-ggml-q4_0.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt
llama_print_timings: load time = 13699.05 ms
llama_print_timings: sample time = 22.91 ms / 50 runs ( 0.46 ms per run)
llama_print_timings: prompt eval time = 11899.14 ms / 399 tokens ( 29.82 ms per token)
llama_print_timings: eval time = 10496.48 ms / 49 runs ( 214.21 ms per run)
llama_print_timings: total time = 24218.98 ms
@LostRuins I have a thread going on in the discussions where people are trying out the Kobold clblast implementation. On my integrated Intel HD530 clblast prompt ingestion was twice as slow as openblas but someone with a Nvidia 3060 reported a 50% improvement on his end.
Here are benchmarks for my system
Note: This is with the non-quantized 13B-16bit model
- cpu ryzen 7900x
- gpu 1080ti
- ram 64GiB@5200
With cublas
make clean && LLAMA_CUBLAS=1 make -j && ./main --mlock -t 8 -b 512 -m ./models/13B/ggml-model-f16.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt
llama_print_timings: load time = 20691.75 ms
llama_print_timings: sample time = 16.89 ms / 50 runs ( 0.34 ms per run)
llama_print_timings: prompt eval time = 18748.63 ms / 373 tokens ( 50.26 ms per token)
llama_print_timings: eval time = 24565.83 ms / 49 runs ( 501.34 ms per run)
llama_print_timings: total time = 45275.08 ms
With OpenBLAS
make clean && LLAMA_OPENBLAS=1 make -j && ./main --mlock -t 8 -b 512 -m ./models/13B/ggml-model-f16.bin -c 1024 -n 50 -s 4201488 -f ./prompts/prompt.txt
llama_print_timings: load time = 43043.43 ms
llama_print_timings: sample time = 17.31 ms / 50 runs ( 0.35 ms per run)
llama_print_timings: prompt eval time = 27472.01 ms / 373 tokens ( 73.65 ms per token)
llama_print_timings: eval time = 24480.05 ms / 49 runs ( 499.59 ms per run)
llama_print_timings: total time = 67541.45 ms
So that's a ~48% total time speedup, super nice!
cc @ravenscroftj
Might be interested in adding cuBLAS support to turbopilot to speed-up prompt processing. This change works with low-VRAM cards even for big models and is optionally enabled with GGML_USE_CUBLAS
compile flag:
https://github.com/ggerganov/llama.cpp/blob/master/Makefile#L107-L115
Will be available in the ggml
repo soon as well
oh that is awesome thanks for the tag @ggerganov - will definitely be looking at adding this as making suggestions much faster will make turbopilot much more usable!