langchain Issue: LLamacpp wrapper slows down the model

Issue you'd like to raise.

Looks like the inference time of the LLamacpp model is a lot slower when using LlamaCpp wrapper (compared to the llama-cpp original wrapper).

Here are the results for the same prompt on the RTX 4090 GPU.

When using llamacpp-python Llama wrapper directly:

llamacpp_runtime

When using langchain LlamaCpp wrapper:

runtime_langchain

As you can see, it takes nearly 12x more time for the prompt_eval stage (2.67 ms per token vs 35 ms per token)

Am i missing something? In both cases, the model is fully loaded to the GPU. In the case of the Langchain wrapper, no chain was used, just direct querying of the model using the wrapper's interface. Same parameters.

Link to the example notebook (values are a lil different, but the problem is the same): https://github.com/mmagnesium/personal-assistant/blob/main/notebooks/langchain_vs_llamacpp.ipynb

Appreciate any help.

Suggestion:

Unfortunately, no suggestion, since i don't understand whats the problem.

May 21 '23 21:05 sadaisystems

The issue persists with new ggmlv3 quantized models. Tested using manticore-13B (https://huggingface.co/openaccess-ai-collective/manticore-13b).

However, the evaluation time is now a little bit reduced since this new standard is faster and more compact.

May 22 '23 10:05 sadaisystems

+1 evalution times of GPT4All is also very high as compared to the UI evalutation that the nomic team is providing.

May 23 '23 12:05 dhirajsuvarna

ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), memory=memory)

This comes back within a few secs when using a OpenAI llm, while 10x as slow when using GPT4All on ggml-gpt4all-j-v1.3-groovy
However, GPT4All is not slow, as I get similar response times when using the library directly (not via langchain)

Machine Specs Ryzen 5600X 32gb ddr4 8gb GTX 1080 (Irrelevant, does not use GPU)

May 23 '23 16:05 intermag0

Looks like the issue was in misunderstanding of the n_batch parameter of the LLamaCpp wrapper. The default value is 8, which is a kind of small number if you want to utilize the GPU to its full potential. Seems like this value is not that low by default in the original llama-cpp-python model wrapper.

After setting this value to 512 or more, the issue was solved.

@dhirajsuvarna, @intermag0 you might want to look for an analog of this parameter inside the GPT4All model wrapper. Also, n_threads seems important too if you are using a CPU.

May 27 '23 14:05 sadaisystems

In order to ensure that the same issue wont appear in the future, I updated the demonstration notebook of llamacpp model integration (with #5344).

The issue now can be closed.

May 27 '23 15:05 sadaisystems

langchain langchain copied to clipboard

Issue: LLamacpp wrapper slows down the model

Issue you'd like to raise.

Suggestion:

langchain
langchain copied to clipboard