llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Llama.cpp GPU Offloading Issue - Unexpected Switch to CPU

Open MontassarTn opened this issue 10 months ago • 7 comments

I'm reaching out to the community for some assistance with an issue I'm encountering in llama.cpp. Previously, the program was successfully utilizing the GPU for execution. However, recently, it seems to have switched to CPU execution.

Observations:

BLAS=1 is set, indicating the use of BLAS routines (likely for linear algebra operations). llm_load_print_meta: LF token = 13 '<0x0A>' (potential output related to loading metadata, but its specific meaning might be context-dependent). llm_load_tensors: ggml ctx size = 0.11 MiB (indicates the size of the global memory context, which seems relatively small). llm_load_tensors: offloading 0 repeating layers to GPU (no repeating layers are being offloaded to the GPU). llm_load_tensors: offloaded 0/33 layers to GPU (no layers have been offloaded to the GPU). llm_load_tensors: CPU buffer size = 7338.64 MiB (a significant amount of data is being loaded into CPU buffers).

Code: llm = LlamaCpp( model_path=model_path, n_ctx= 2048, max_tokens=250, verbose=True, n_gpu_layer= -1, n_batch=512, temperature=0.3, top_p=0.95, top_k=40, min_p=0.05,
)

Questions:

Has anyone else encountered a similar situation with llama.cpp switching from GPU to CPU execution? Are there any known configuration changes or environmental factors that might be causing this behavior? Could there be specific conditions in my code that are preventing GPU offloading?

MontassarTn avatar Apr 18 '24 13:04 MontassarTn

n_gpu_layer= -1,

This isn't a thing, you've set your GPU to use no layers. Increase the #.

Jeximo avatar Apr 18 '24 13:04 Jeximo

@Jeximo n_gpu_layers = -1 # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.

MontassarTn avatar Apr 18 '24 13:04 MontassarTn

If you don't know how many layers there are, you can use -1 to move all to GPU.

That's not the case in the llama.cpp C API.

slaren avatar Apr 18 '24 14:04 slaren

Even when I try with 30 it still the same issue

MontassarTn avatar Apr 18 '24 14:04 MontassarTn

Even when I try with 30 it still the same issue

https://github.com/ggerganov/llama.cpp/blob/4e9a7f7f7fb6acbddd1462909c8d696e38edbfcc/examples/main/README.md?plain=1#L318

The original post has a typo in the parameter, --n-gpu-layers N. Did you use --n-gpu-layers 30, and still it said "offloaded 0/33 layers to GPU"?

Jeximo avatar Apr 18 '24 14:04 Jeximo

yes still offloaded 0/33 layers to GPU :( image I used Llama cpp from langchain

MontassarTn avatar Apr 18 '24 14:04 MontassarTn

I used Llama cpp from langchain

I see. All I can say for sure is the langchang wrapper is not passing the parameter as expected, and your image shows -1 instead of 30.

Maybe try llama.cpp without a wrapper.

Jeximo avatar Apr 18 '24 19:04 Jeximo

Hi, @MontassarTn, facing the same issue, did you get any workaround for this?

whoami02 avatar Jun 14 '24 09:06 whoami02