langchain icon indicating copy to clipboard operation
langchain copied to clipboard

DOC: The description of ·n_gpu_layers· in https://python.langchain.com/docs/integrations/llms/llamacpp#metal is incorrect

Open 169 opened this issue 5 months ago • 2 comments

Issue with current documentation:

In the Metal section of the https://python.langchain.com/docs/integrations/llms/llamacpp document, the description of n_gpu_layers is Metal set to 1 is enough.

I haven't found the exact reason for this. and when I tested it locally, I felt that using a larger value of n_gpu_layers would significantly improve the execution speed. I have a complete ipynb file here: https://github.com/169/ai-snippets/blob/main/llama-cpp.ipynb

Here I explain why I came to this conclusion.

First, use the main compiled by llama.cpp to perform inference. You can see that by default, all 33 layers are offloaded to the GPU:

llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: Metal buffer size = 4095.07 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB

It’s also fast: about 33 tokens/s (From total time = 3760.69 ms / 124 tokens)

But if I use n_gpu_layers=1, only one layer is offloaded to the GPU, and the rest is given to the CPU:

llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/33 layers to GPU
llm_load_tensors: CPU buffer size = 4165.37 MiB
llm_load_tensors: Metal buffer size = 132.51 MiB

Much slower: about 18 token/s (From 2435.17 ms / 43 tokens)

The same condition, changed to n_gpu_layers=33, has the same effect as using ./main:

llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: Metal buffer size = 4095.07 MiB

The speed has also increased to about 31 token/s.

So I think that within the optional range, the larger the value of n_gpu_layers, the faster the inference. There are also posts similar to this one with doubts: https://www.reddit.com/r/LangChain/comments/18lb4n4/llamacpp_on_mac_n_gpu_layers_n_batch/

I'm a bit confused, So, I added a PR to remove this part of the description.

@genewoo I see you added this part, do you have any other context proving that using n_gpu_layers=1 is a best practice?

Idea or request for content:

No response

169 avatar Jan 24 '24 11:01 169

🤖

It's great to see that you've identified an issue with the documentation and have taken the initiative to create a pull request to address it. Your detailed explanation and the example from your ipynb file are very helpful in understanding the problem. Your contribution will definitely improve the overall quality of the documentation. Keep up the good work!


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Jan 24 '24 11:01 dosubot[bot]

@hwchase17 please also take a look.

169 avatar Jan 24 '24 13:01 169

Just saw the issue, I thought n_gpu_layers is a setting of how many layers of the model are offloaded to your GPU. In the document, even set as 1, you will enable the offloading. But how many layers, I would like to leave it to the end user.

For document, we only provide suggestions, then developers could optimize it by increasing layers.

I would suggest you build a page to describe how to optimize the layers setting instead of just change the basic instruction.

genewoo avatar Feb 22 '24 03:02 genewoo