langchain
langchain copied to clipboard
DOC: The description of ·n_gpu_layers· in https://python.langchain.com/docs/integrations/llms/llamacpp#metal is incorrect
Issue with current documentation:
In the Metal section of the https://python.langchain.com/docs/integrations/llms/llamacpp document, the description of n_gpu_layers
is Metal set to 1 is enough.
I haven't found the exact reason for this. and when I tested it locally, I felt that using a larger value of n_gpu_layers
would significantly improve the execution speed. I have a complete ipynb file here: https://github.com/169/ai-snippets/blob/main/llama-cpp.ipynb
Here I explain why I came to this conclusion.
First, use the main compiled by llama.cpp
to perform inference. You can see that by default, all 33 layers are offloaded to the GPU:
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: Metal buffer size = 4095.07 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
It’s also fast: about 33 tokens/s (From total time = 3760.69 ms / 124 tokens
)
But if I use n_gpu_layers=1
, only one layer is offloaded to the GPU, and the rest is given to the CPU:
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/33 layers to GPU
llm_load_tensors: CPU buffer size = 4165.37 MiB
llm_load_tensors: Metal buffer size = 132.51 MiB
Much slower: about 18 token/s (From 2435.17 ms / 43 tokens
)
The same condition, changed to n_gpu_layers=33
, has the same effect as using ./main
:
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: Metal buffer size = 4095.07 MiB
The speed has also increased to about 31 token/s.
So I think that within the optional range, the larger the value of n_gpu_layers
, the faster the inference. There are also posts similar to this one with doubts: https://www.reddit.com/r/LangChain/comments/18lb4n4/llamacpp_on_mac_n_gpu_layers_n_batch/
I'm a bit confused, So, I added a PR to remove this part of the description.
@genewoo I see you added this part, do you have any other context proving that using n_gpu_layers=1
is a best practice?
Idea or request for content:
No response
🤖
It's great to see that you've identified an issue with the documentation and have taken the initiative to create a pull request to address it. Your detailed explanation and the example from your ipynb file are very helpful in understanding the problem. Your contribution will definitely improve the overall quality of the documentation. Keep up the good work!
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@hwchase17 please also take a look.
Just saw the issue, I thought n_gpu_layers is a setting of how many layers of the model are offloaded to your GPU. In the document, even set as 1, you will enable the offloading. But how many layers, I would like to leave it to the end user.
For document, we only provide suggestions, then developers could optimize it by increasing layers.
I would suggest you build a page to describe how to optimize the layers setting instead of just change the basic instruction.