llama.cpp
llama.cpp copied to clipboard
Llama.cpp GPU Offloading Issue - Unexpected Switch to CPU
I'm reaching out to the community for some assistance with an issue I'm encountering in llama.cpp. Previously, the program was successfully utilizing the GPU for execution. However, recently, it seems to have switched to CPU execution.
Observations:
BLAS=1 is set, indicating the use of BLAS routines (likely for linear algebra operations). llm_load_print_meta: LF token = 13 '<0x0A>' (potential output related to loading metadata, but its specific meaning might be context-dependent). llm_load_tensors: ggml ctx size = 0.11 MiB (indicates the size of the global memory context, which seems relatively small). llm_load_tensors: offloading 0 repeating layers to GPU (no repeating layers are being offloaded to the GPU). llm_load_tensors: offloaded 0/33 layers to GPU (no layers have been offloaded to the GPU). llm_load_tensors: CPU buffer size = 7338.64 MiB (a significant amount of data is being loaded into CPU buffers).
Code:
llm = LlamaCpp( model_path=model_path,
n_ctx= 2048,
max_tokens=250,
verbose=True,
n_gpu_layer= -1,
n_batch=512,
temperature=0.3,
top_p=0.95,
top_k=40,
min_p=0.05,
)
Questions:
Has anyone else encountered a similar situation with llama.cpp switching from GPU to CPU execution? Are there any known configuration changes or environmental factors that might be causing this behavior? Could there be specific conditions in my code that are preventing GPU offloading?
n_gpu_layer= -1,
This isn't a thing, you've set your GPU to use no layers. Increase the #.
@Jeximo n_gpu_layers = -1 # The number of layers to put on the GPU. The rest will be on the CPU. If you don't know how many layers there are, you can use -1 to move all to GPU.
If you don't know how many layers there are, you can use -1 to move all to GPU.
That's not the case in the llama.cpp C API.
Even when I try with 30 it still the same issue
Even when I try with 30 it still the same issue
https://github.com/ggerganov/llama.cpp/blob/4e9a7f7f7fb6acbddd1462909c8d696e38edbfcc/examples/main/README.md?plain=1#L318
The original post has a typo in the parameter, --n-gpu-layers N
. Did you use --n-gpu-layers 30
, and still it said "offloaded 0/33 layers to GPU"?
yes still offloaded 0/33 layers to GPU :(
I used Llama cpp from langchain
I used Llama cpp from langchain
I see. All I can say for sure is the langchang wrapper is not passing the parameter as expected, and your image shows -1 instead of 30.
Maybe try llama.cpp
without a wrapper.
Hi, @MontassarTn, facing the same issue, did you get any workaround for this?