llama.cpp Generation using GPU offloading is much slower than without

Recently, generating a text with large preexisting context has become very slow when using GPU offloading. I tried llama-cpp-python versions 0.1.61 and 0.1.57 - I get the same behavior.

I am running an Oobabooga installation on Windows 11. The machine is AMD3950X, 32 gb RAM, 3070Ti (8 Gb VRAM). When I try to extend an already long text, with GPU offloading used, I get these numbers:

llama_print_timings:        load time = 36434.52 ms
llama_print_timings:      sample time =    40.21 ms /    65 runs   (    0.62 ms per token)
llama_print_timings: prompt eval time = 146712.89 ms /  1294 tokens (  113.38 ms per token)
llama_print_timings:        eval time = 15896.38 ms /    64 runs   (  248.38 ms per token)
llama_print_timings:       total time = 163419.18 ms
Output generated in 163.82 seconds (0.39 tokens/s, 64 tokens, context 1623, seed 1809972029)
Llama.generate: prefix-match hit

note the prompt eval time - two minutes. But without the offloading I get:

llama_print_timings:        load time =  5647.99 ms
llama_print_timings:      sample time =    66.99 ms /   111 runs   (    0.60 ms per token)
llama_print_timings: prompt eval time = 10762.28 ms /  1245 tokens (    8.64 ms per token)
llama_print_timings:        eval time = 34006.30 ms /   110 runs   (  309.15 ms per token)
llama_print_timings:       total time = 46578.66 ms
Output generated in 46.96 seconds (2.34 tokens/s, 110 tokens, context 1576, seed 874103700)
Llama.generate: prefix-match hit

Those runs use the same model (13Bq5_1), same prompt, similar context. The only difference is the argument --n-gpu-layers 26 in the first case and none in the second.

I did check the loading logs for lines like

llama_model_load_internal: offloading 16 layers to GPU
llama_model_load_internal: total VRAM used: 4143 MB

to confirm that it loads as I intended.

Jun 10 '23 07:06 Barafu

Same here, much much slower without gpu offloading in my case its close to 80ish ms per token, but with off loading its 700ish ms per token...

Jun 10 '23 09:06 RiyanParvez

And it takes more time to load the model too.

Jun 10 '23 09:06 RiyanParvez

Experiencing the same exact issue with the debug version CUBLAS. For context the release version gives buggy output. #1735

The buggy Release version with the gibberish output is far faster for me, but the properly working Debug version is extremely slow like observed here.

UPDATE: Using latest build made both of these issues seemingly go away on release version, debug is still slow I don't know why. Anyway it is far faster now, if anyone else is having trouble I recommend to build Release x64 from source and use the latest version (as of now it is 74a69d2).

Jun 10 '23 13:06 TonyWeimmer40

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 10 '24 01:04 github-actions[bot]