langchain
langchain copied to clipboard
Unable to run llama.cpp or GPT4All demos
I'm attempting to run both demos linked today but am running into issues. I've already migrated my GPT4All model.
When I run the llama.cpp demo all of my CPU cores are pegged at 100% for a minute or so and then it just exits without an error code or output.
When I run the GPT4All demo I get the following error:
Traceback (most recent call last):
File "/home/zetaphor/Code/langchain-demo/gpt4alldemo.py", line 12, in <module>
llm = GPT4All(model_path="models/gpt4all-lora-quantized-new.bin")
File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
File "pydantic/main.py", line 1102, in pydantic.main.validate_model
File "/home/zetaphor/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain/llms/gpt4all.py", line 132, in validate_environment
ggml_model=values["model"],
KeyError: 'model'
Try changing the model_path
param to model
.
I tried the following on Colab, but the last line never finishes... Does anyone have a clue?
from langchain.llms import GPT4All
from langchain import PromptTemplate, LLMChain
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm = GPT4All(model="{path_to_ggml}")
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)
Try changing the model_path param to model.
That got me a little further, now I'm getting the following output from the GPT4All model:
llama_model_load: loading model from 'models/gpt4all-lora-quantized-new.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size = 81.25 KB
llama_model_load: mem required = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from 'models/gpt4all-lora-quantized-new.bin'
llama_model_load: model size = 4017.27 MB / num tensors = 291
llama_init_from_file: kv self size = 512.00 MB
llama_generate: seed = 1680637863
system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
generate: n_ctx = 512, n_batch = 1, n_predict = 256, n_keep = 0
[end of text]
llama_print_timings: load time = 2298.94 ms
llama_print_timings: sample time = 78.52 ms / 150 runs ( 0.52 ms per run)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token)
llama_print_timings: eval time = 44807.91 ms / 181 runs ( 247.56 ms per run)
llama_print_timings: total time = 45558.50 ms
fish: Job 1, 'python gpt4alldemo.py' terminated by signal SIGSEGV (Address boundary error)
Hi @Zetaphor are you referring to this Llama demo?
I'm the author of the llama-cpp-python
library, I'd be happy to help.
Can you give me an idea of what kind of processor you're running and the length of your prompt?
Because llama.cpp is running inference on the CPU it can take a while to process the initial prompt and there are still some performance issues with certain CPU architectures. To rule that out have can you try running the same prompt through the examples in llama.cpp?
Hey @abetlen,
I'm able to run the 7B model on both my laptop and my server without issues. Here are the specs for both:
- Laptop: Ryzen 9 5900HS, 40GB RAM
- Server: Xeon 6c/12t, 64GB RAM
I'm trying to run the prompt provided in the demo code:
What NFL team won the Super Bowl in the year Justin Bieber was born?
I have not yet run this exact prompt through llama.cpp, but I've been able to successfully run the chat with bob prompt on both my laptop and server. On the server in addition to running the GPT4All model I've also used the Vicuna 13B model.
On the suggestion of someone in Discord I'm able to get output using the llama.cpp model, it looks like the thing we needed to do here was to increase the token context to a much higher value. However I am definitely seeing reduced performance compared to what I experience when just running inference through llama.cpp
import os
from langchain.memory import ConversationTokenBufferMemory
from langchain.agents.tools import Tool
from langchain.chat_models import ChatOpenAI
from langchain.llms.base import LLM
from langchain.llms.llamacpp import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.agents import load_tools, initialize_agent, AgentExecutor, BaseSingleActionAgent, AgentType
custom_llm = LlamaCpp(model_path="models/ggml-vicuna-13b-4bit.bin", verbose=True,
n_threads=4, n_ctx=5000, temperature=0.05, repeat_penalty=1.22)
tools = []
memories = {}
question = "What is the meaning of life?"
unique_id = os.urandom(16).hex()
if unique_id not in memories:
memories[unique_id] = ConversationTokenBufferMemory(
memory_key="chat_history", llm=custom_llm, return_messages=True)
memory = memories[unique_id]
agent = initialize_agent(tools, llm=custom_llm, agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
verbose=True, memory=memory, max_retries=1)
response = agent.run(input=question)
print(response)
@Zetaphor Correct, llama.cpp has set the default token context window at 512 for performance, which is also the default n_ctx
value in langchain. You can set it at 2048 max, but this will slow down inference.
Thanks for the reply, based on that I think it's related to this issue. I've opened a PR (#2411) to give you control over the batch size from LangChain (this got missed in the initial merge).
For now, you can also enable verbose logging and update the n_batch
with custom_llm.client.verbose = True
and change self.custom_llm.client.n_batch
respectively.
The verbose logs should also give you an idea of the per token performance compared to llama.cpp
as it's using the same timing methods so let me know if they look very off.
EDIT: You'll need to pip install --upgrade llama-cpp-python
as verbose
was just added.
For me a simple question like this works:
print(llm_chain("tell me about Japan"))
but not the llm_chain.run(question)
also tried something more complex and got a floating point exception
:
more info
@hershkoy I've not seen that floating point exception before but this is using a different library, I suspect it might be a bug with n_ctx
being too large maybe? Just out of curiosity, can you try loading that gpt4all-converted.bin
model in the LlamaCpp
class? I'm not sure what the version compatibility is between our two libs but if both fail in the same way it may be a bug in llama.cpp
.
@Zetaphor Correct, llama.cpp has set the default token context window at 512 for performance, which is also the default
n_ctx
value in langchain. You can set it at 2048 max, but this will slow down inference.
Just FYI, the slowdown in performance is a bug. It's being investigated here ggerganov/llama.cpp#603 . Inference should NOT slow down with increased context
can you try loading that
gpt4all-converted.bin
model in theLlamaCpp
class? I'm not sure what the version compatibility is between our two libs but if both fail in the same way it may be a bug inllama.cpp
.
I'm running a chain successfully loading with LlamaCpp class (eg in https://gist.github.com/psychemedia/51f45fbfe160f78605bdd0c1b404e499) but not the GPT4All one.
(Macbook pro mid-2015 intel)
can you try loading that
gpt4all-converted.bin
model in theLlamaCpp
class
@abetlen I am new to this. I can try. Can you explain how to do it?
@hershkoy absolutely, all you have to do is change the following two lines
First update the llm import near the top of your file
from langchain.llms import LlamaCpp
and then where you instantiate the class
llm = LlamaCpp(model_path="gpt4all-converted.bin", n_ctx=2048)
Let me know if that still gives you an error on your system
first, It seems that I was missing a print around the llm_chain.run(question)
. I don't remember where I took that code from.
I added:
print(llm_chain.run(question))
and now I get an output.
when running LlamaCpp class, there is an output, and the program quits with no error. where running GPT4All class, there is an output, but there is an error after the chain runs. no sure how it is possible...
GPT4ALL
LLAMACPP
@hershkoy Upgrading my dependencies to langchain-0.0.132
and pyllamacpp-1.0.6
fixed the segfaults for me.
I had the same problem. Upgrading my version of Pydantic fixed it:
pip install pydantic==1.10.7
Same problem. But I've never gotten any output even if I used print(). Tried setting the context and everything, still couldnt find a solution.
Things I have tried: GPT4All basic prompt (no output, stuck forever) Langchain GPT4All prompt (no output, stuck forever) GPT4All Chat Mode (works without a problem)
Yeah I've had this same problem yesterday with the llama.cpp model. It was stuck and never produced any output, but I did get an output if I use llama.cpp directly (i.e. not through langchain). Strangely enough when I tried running the same code today with langchain, it worked just fine.
Leaving some traceback logs here when I pressed Ctrl + C while stuck, if its any help.
^C^C^CTraceback (most recent call last):
File "/Users/vicevirus/Downloads/gpt4all/test.py", line 20, in <module>
File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/base.py", line 213, in run
return self(args[0])[self.output_keys[0]]
File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/base.py", line 116, in __call__
raise e
File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/base.py", line 113, in __call__
outputs = self._call(inputs)
File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/llm.py", line 57, in _call
return self.apply([inputs])[0]
File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/llm.py", line 118, in apply
response = self.generate(input_list)
File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/llm.py", line 62, in generate
return self.llm.generate_prompt(prompts, stop)
File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 107, in generate_prompt
return self.generate(prompt_strings, stop=stop)
File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 140, in generate
raise e
File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 137, in generate
output = self._generate(prompts, stop=stop)
File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 324, in _generate
text = self._call(prompt, stop=stop)
File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/gpt4all.py", line 177, in _call
text = self.client.generate(
File "/opt/homebrew/lib/python3.10/site-packages/pyllamacpp/model.py", line 112, in generate
pp.llama_generate(self._ctx, self.gpt_params, self._call_new_text_callback, verbose)
File "/opt/homebrew/lib/python3.10/site-packages/pyllamacpp/model.py", line 77, in _call_new_text_callback
def _call_new_text_callback(self, text) -> None:
KeyboardInterrupt
Specifically for llama.cpp I think https://github.com/hwchase17/langchain/issues/2404#issuecomment-1497521897 points to the issue being in the Callback Manager.
It's likely that due to the async nature of callback manager the "main" program exits before the chain returns.
To test this I put a sleep loop, but it also seems that perhaps the callback manager isn't being used with run
or is faulty for this LLM wrapper.
This shows that streaming should be used. (the streaming property is True by default) https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L217-L228
The CallbackManager should also be in play as well. Would be worth stepping through this with a debugger. https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L271-L273
Specifically for llama.cpp I think #2404 (comment) points to the issue being in the Callback Manager.
It's likely that due to the async nature of callback manager the "main" program exits before the chain returns.
To test this I put a sleep loop, but it also seems that perhaps the callback manager isn't being used with
run
or is faulty for this LLM wrapper.This shows that streaming should be used. (the streaming property is True by default) https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L217-L228
The CallbackManager should also be in play as well. Would be worth stepping through this with a debugger. https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L271-L273
You don't happen to have an example at hand of an minimal example with LlamaCpp and streaming the output, word for word? I am only getting the output streamed on the console but cannot write it into an object and I don't understand how to access the stream.
Hi, @Zetaphor! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, you were experiencing issues running the llama.cpp and GPT4All demos. You mentioned that you tried changing the model_path
parameter to model
and made some progress with the GPT4All demo, but still encountered a segmentation fault. Other users suggested upgrading dependencies, changing the token context window, and using verbose logging. However, the issue remains unresolved.
Could you please let us know if this issue is still relevant to the latest version of the LangChain repository? If it is, please comment on this issue to let us know. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your understanding and cooperation!