langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Unable to run llama.cpp or GPT4All demos

Open Zetaphor opened this issue 1 year ago • 22 comments

I'm attempting to run both demos linked today but am running into issues. I've already migrated my GPT4All model.

When I run the llama.cpp demo all of my CPU cores are pegged at 100% for a minute or so and then it just exits without an error code or output.

When I run the GPT4All demo I get the following error:

Traceback (most recent call last):
  File "/home/zetaphor/Code/langchain-demo/gpt4alldemo.py", line 12, in <module>
    llm = GPT4All(model_path="models/gpt4all-lora-quantized-new.bin")
  File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic/main.py", line 1102, in pydantic.main.validate_model
  File "/home/zetaphor/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain/llms/gpt4all.py", line 132, in validate_environment
    ggml_model=values["model"],
KeyError: 'model'

Zetaphor avatar Apr 04 '23 19:04 Zetaphor

Try changing the model_path param to model.

I tried the following on Colab, but the last line never finishes... Does anyone have a clue?

from langchain.llms import GPT4All
from langchain import PromptTemplate, LLMChain

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm = GPT4All(model="{path_to_ggml}")
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)

0xhiroki avatar Apr 04 '23 19:04 0xhiroki

Try changing the model_path param to model.

That got me a little further, now I'm getting the following output from the GPT4All model:

llama_model_load: loading model from 'models/gpt4all-lora-quantized-new.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from 'models/gpt4all-lora-quantized-new.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
llama_generate: seed = 1680637863

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
generate: n_ctx = 512, n_batch = 1, n_predict = 256, n_keep = 0


 [end of text]

llama_print_timings:        load time =  2298.94 ms
llama_print_timings:      sample time =    78.52 ms /   150 runs   (    0.52 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time = 44807.91 ms /   181 runs   (  247.56 ms per run)
llama_print_timings:       total time = 45558.50 ms
fish: Job 1, 'python gpt4alldemo.py' terminated by signal SIGSEGV (Address boundary error)

Zetaphor avatar Apr 04 '23 19:04 Zetaphor

Hi @Zetaphor are you referring to this Llama demo?

I'm the author of the llama-cpp-python library, I'd be happy to help.

Can you give me an idea of what kind of processor you're running and the length of your prompt?

Because llama.cpp is running inference on the CPU it can take a while to process the initial prompt and there are still some performance issues with certain CPU architectures. To rule that out have can you try running the same prompt through the examples in llama.cpp?

abetlen avatar Apr 04 '23 20:04 abetlen

Hey @abetlen,

I'm able to run the 7B model on both my laptop and my server without issues. Here are the specs for both:

  • Laptop: Ryzen 9 5900HS, 40GB RAM
  • Server: Xeon 6c/12t, 64GB RAM

I'm trying to run the prompt provided in the demo code:

What NFL team won the Super Bowl in the year Justin Bieber was born?

I have not yet run this exact prompt through llama.cpp, but I've been able to successfully run the chat with bob prompt on both my laptop and server. On the server in addition to running the GPT4All model I've also used the Vicuna 13B model.

Zetaphor avatar Apr 04 '23 20:04 Zetaphor

On the suggestion of someone in Discord I'm able to get output using the llama.cpp model, it looks like the thing we needed to do here was to increase the token context to a much higher value. However I am definitely seeing reduced performance compared to what I experience when just running inference through llama.cpp

import os
from langchain.memory import ConversationTokenBufferMemory
from langchain.agents.tools import Tool
from langchain.chat_models import ChatOpenAI
from langchain.llms.base import LLM
from langchain.llms.llamacpp import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.agents import load_tools, initialize_agent, AgentExecutor, BaseSingleActionAgent, AgentType


custom_llm = LlamaCpp(model_path="models/ggml-vicuna-13b-4bit.bin", verbose=True,
                      n_threads=4, n_ctx=5000, temperature=0.05, repeat_penalty=1.22)
tools = []

memories = {}

question = "What is the meaning of life?"
unique_id = os.urandom(16).hex()
if unique_id not in memories:
    memories[unique_id] = ConversationTokenBufferMemory(
        memory_key="chat_history", llm=custom_llm, return_messages=True)
memory = memories[unique_id]
agent = initialize_agent(tools, llm=custom_llm, agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
                         verbose=True, memory=memory, max_retries=1)

response = agent.run(input=question)

print(response)

Zetaphor avatar Apr 04 '23 20:04 Zetaphor

@Zetaphor Correct, llama.cpp has set the default token context window at 512 for performance, which is also the default n_ctx value in langchain. You can set it at 2048 max, but this will slow down inference.

rjadr avatar Apr 04 '23 21:04 rjadr

Thanks for the reply, based on that I think it's related to this issue. I've opened a PR (#2411) to give you control over the batch size from LangChain (this got missed in the initial merge).

For now, you can also enable verbose logging and update the n_batch with custom_llm.client.verbose = True and change self.custom_llm.client.n_batch respectively.

The verbose logs should also give you an idea of the per token performance compared to llama.cpp as it's using the same timing methods so let me know if they look very off.

EDIT: You'll need to pip install --upgrade llama-cpp-python as verbose was just added.

abetlen avatar Apr 04 '23 21:04 abetlen

For me a simple question like this works:

print(llm_chain("tell me about Japan"))

but not the llm_chain.run(question)

also tried something more complex and got a floating point exception:

image

hershkoy avatar Apr 04 '23 22:04 hershkoy

more info image

hershkoy avatar Apr 04 '23 22:04 hershkoy

@hershkoy I've not seen that floating point exception before but this is using a different library, I suspect it might be a bug with n_ctx being too large maybe? Just out of curiosity, can you try loading that gpt4all-converted.bin model in the LlamaCpp class? I'm not sure what the version compatibility is between our two libs but if both fail in the same way it may be a bug in llama.cpp.

abetlen avatar Apr 04 '23 22:04 abetlen

@Zetaphor Correct, llama.cpp has set the default token context window at 512 for performance, which is also the default n_ctx value in langchain. You can set it at 2048 max, but this will slow down inference.

Just FYI, the slowdown in performance is a bug. It's being investigated here ggerganov/llama.cpp#603 . Inference should NOT slow down with increased context

MillionthOdin16 avatar Apr 05 '23 04:04 MillionthOdin16

can you try loading that gpt4all-converted.bin model in the LlamaCpp class? I'm not sure what the version compatibility is between our two libs but if both fail in the same way it may be a bug in llama.cpp.

I'm running a chain successfully loading with LlamaCpp class (eg in https://gist.github.com/psychemedia/51f45fbfe160f78605bdd0c1b404e499) but not the GPT4All one.

(Macbook pro mid-2015 intel)

psychemedia avatar Apr 05 '23 07:04 psychemedia

can you try loading that gpt4all-converted.bin model in the LlamaCpp class

@abetlen I am new to this. I can try. Can you explain how to do it?

hershkoy avatar Apr 05 '23 09:04 hershkoy

@hershkoy absolutely, all you have to do is change the following two lines

First update the llm import near the top of your file

from langchain.llms import LlamaCpp

and then where you instantiate the class

llm = LlamaCpp(model_path="gpt4all-converted.bin", n_ctx=2048)

Let me know if that still gives you an error on your system

abetlen avatar Apr 05 '23 09:04 abetlen

first, It seems that I was missing a print around the llm_chain.run(question). I don't remember where I took that code from. I added:

print(llm_chain.run(question))

and now I get an output.

when running LlamaCpp class, there is an output, and the program quits with no error. where running GPT4All class, there is an output, but there is an error after the chain runs. no sure how it is possible...

GPT4ALL

image

LLAMACPP

image

hershkoy avatar Apr 05 '23 13:04 hershkoy

@hershkoy Upgrading my dependencies to langchain-0.0.132 and pyllamacpp-1.0.6 fixed the segfaults for me.

jhaan1979 avatar Apr 05 '23 18:04 jhaan1979

I had the same problem. Upgrading my version of Pydantic fixed it:

 pip install pydantic==1.10.7

benjamintanweihao avatar Apr 06 '23 15:04 benjamintanweihao

Same problem. But I've never gotten any output even if I used print(). Tried setting the context and everything, still couldnt find a solution.

Things I have tried: GPT4All basic prompt (no output, stuck forever) Langchain GPT4All prompt (no output, stuck forever) GPT4All Chat Mode (works without a problem)

image

vicevirus avatar Apr 09 '23 14:04 vicevirus

Yeah I've had this same problem yesterday with the llama.cpp model. It was stuck and never produced any output, but I did get an output if I use llama.cpp directly (i.e. not through langchain). Strangely enough when I tried running the same code today with langchain, it worked just fine.

harshil21 avatar Apr 09 '23 14:04 harshil21

Leaving some traceback logs here when I pressed Ctrl + C while stuck, if its any help.

^C^C^CTraceback (most recent call last):
  File "/Users/vicevirus/Downloads/gpt4all/test.py", line 20, in <module>
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/base.py", line 213, in run
    return self(args[0])[self.output_keys[0]]
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/base.py", line 116, in __call__
    raise e
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/base.py", line 113, in __call__
    outputs = self._call(inputs)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/llm.py", line 57, in _call
    return self.apply([inputs])[0]
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/llm.py", line 118, in apply
    response = self.generate(input_list)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/llm.py", line 62, in generate
    return self.llm.generate_prompt(prompts, stop)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 107, in generate_prompt
    return self.generate(prompt_strings, stop=stop)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 140, in generate
    raise e
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 137, in generate
    output = self._generate(prompts, stop=stop)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 324, in _generate
    text = self._call(prompt, stop=stop)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/gpt4all.py", line 177, in _call
    text = self.client.generate(
  File "/opt/homebrew/lib/python3.10/site-packages/pyllamacpp/model.py", line 112, in generate
    pp.llama_generate(self._ctx, self.gpt_params, self._call_new_text_callback, verbose)
  File "/opt/homebrew/lib/python3.10/site-packages/pyllamacpp/model.py", line 77, in _call_new_text_callback
    def _call_new_text_callback(self, text) -> None:
KeyboardInterrupt

vicevirus avatar Apr 09 '23 14:04 vicevirus

Specifically for llama.cpp I think https://github.com/hwchase17/langchain/issues/2404#issuecomment-1497521897 points to the issue being in the Callback Manager.

It's likely that due to the async nature of callback manager the "main" program exits before the chain returns.

To test this I put a sleep loop, but it also seems that perhaps the callback manager isn't being used with run or is faulty for this LLM wrapper.


This shows that streaming should be used. (the streaming property is True by default) https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L217-L228

The CallbackManager should also be in play as well. Would be worth stepping through this with a debugger. https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L271-L273

Freyert avatar Apr 29 '23 02:04 Freyert

Specifically for llama.cpp I think #2404 (comment) points to the issue being in the Callback Manager.

It's likely that due to the async nature of callback manager the "main" program exits before the chain returns.

To test this I put a sleep loop, but it also seems that perhaps the callback manager isn't being used with run or is faulty for this LLM wrapper.

This shows that streaming should be used. (the streaming property is True by default) https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L217-L228

The CallbackManager should also be in play as well. Would be worth stepping through this with a debugger. https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L271-L273

You don't happen to have an example at hand of an minimal example with LlamaCpp and streaming the output, word for word? I am only getting the output streamed on the console but cannot write it into an object and I don't understand how to access the stream.

fabmeyer avatar May 03 '23 15:05 fabmeyer

Hi, @Zetaphor! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were experiencing issues running the llama.cpp and GPT4All demos. You mentioned that you tried changing the model_path parameter to model and made some progress with the GPT4All demo, but still encountered a segmentation fault. Other users suggested upgrading dependencies, changing the token context window, and using verbose logging. However, the issue remains unresolved.

Could you please let us know if this issue is still relevant to the latest version of the LangChain repository? If it is, please comment on this issue to let us know. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation!

dosubot[bot] avatar Sep 22 '23 16:09 dosubot[bot]