langchain
langchain copied to clipboard
Issue: LLamacpp wrapper slows down the model
Issue you'd like to raise.
Looks like the inference time of the LLamacpp model is a lot slower when using LlamaCpp wrapper (compared to the llama-cpp original wrapper).
Here are the results for the same prompt on the RTX 4090 GPU.
When using llamacpp-python Llama wrapper directly:
When using langchain LlamaCpp wrapper:
As you can see, it takes nearly 12x more time for the prompt_eval stage (2.67 ms per token vs 35 ms per token)
Am i missing something? In both cases, the model is fully loaded to the GPU. In the case of the Langchain wrapper, no chain was used, just direct querying of the model using the wrapper's interface. Same parameters.
Link to the example notebook (values are a lil different, but the problem is the same): https://github.com/mmagnesium/personal-assistant/blob/main/notebooks/langchain_vs_llamacpp.ipynb
Appreciate any help.
Suggestion:
Unfortunately, no suggestion, since i don't understand whats the problem.
The issue persists with new ggmlv3 quantized models. Tested using manticore-13B (https://huggingface.co/openaccess-ai-collective/manticore-13b).
However, the evaluation time is now a little bit reduced since this new standard is faster and more compact.
+1 evalution times of GPT4All is also very high as compared to the UI evalutation that the nomic team is providing.
ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), memory=memory)
- This comes back within a few secs when using a OpenAI llm, while 10x as slow when using GPT4All on ggml-gpt4all-j-v1.3-groovy
- However, GPT4All is not slow, as I get similar response times when using the library directly (not via langchain)
Machine Specs Ryzen 5600X 32gb ddr4 8gb GTX 1080 (Irrelevant, does not use GPU)
Looks like the issue was in misunderstanding of the n_batch parameter of the LLamaCpp wrapper.
The default value is 8, which is a kind of small number if you want to utilize the GPU to its full potential. Seems like this value is not that low by default in the original llama-cpp-python model wrapper.
After setting this value to 512 or more, the issue was solved.
@dhirajsuvarna, @intermag0 you might want to look for an analog of this parameter inside the GPT4All model wrapper.
Also, n_threads seems important too if you are using a CPU.
In order to ensure that the same issue wont appear in the future, I updated the demonstration notebook of llamacpp model integration (with #5344).
The issue now can be closed.