localGPT
localGPT copied to clipboard
Llama.generate: prefix-match hit
The model runs well, although quite slow, in a MacBook Pro M1 MAX using the devise mps. The first question about the document responded well. However, after hitting enter in the second question, the message "Llama.generate: prefix-match hit" appears and it freeses the process. Any thoughts?
@capecha I do see that on my M2 as well but I get a response after some time. Not sure what is causing it. Still looking into it.
I also get the message "Llama.generate: prefix-match hit" sometimes but it doesn't freeze the process for me, so I think this is not an error message. Do you get any other output? Maybe others can help if you post the complete output log.
Thank you to both. I tried rerunning the process yesterday after @PromtEngineer's response, and there is a response. However, it took around 40 mins. I am using the constitution.pdf to try and then use my data. Once I realized it was working, I got to a new problem, cited at https://github.com/PromtEngineer/localGPT/issues/217, that is CSV file generating only one chunk. I tried the privateGPT ingest.py code, generating 236 chunks from the exact file passed to ingest.py in localGPT. Thank you, guys for all your support, and I hope you can address the issue with CSV files.
I am using "llama-2-7b-chat.ggmlv3.q2_K.bin" using "LlamaCpp()" in langchain. The process of "Llama.generate: prefix-match hit" repeats itself so many times. But I want answer only once. How can I set this to generate answer only once?
Facing the same issue. I'm using TheBloke/Llama-2-7B-Chat-GGML
and llama-2-7b-chat.ggmlv3.q4_0.bin
I am using "llama-2-7b-chat.ggmlv3.q2_K.bin" using "LlamaCpp()" in langchain. The process of "Llama.generate: prefix-match hit" repeats itself so many times. But I want answer only once. How can I set this to generate answer only once?
I'm facing the exact same problem with llama-2-13b-chat.ggmlv3.q4_0.bin&llama.cpp. The model reply takes ages and it starts answering itself. It's driving me crazy.
I am using "llama-2-7b-chat.ggmlv3.q2_K.bin" using "LlamaCpp()" in langchain. The process of "Llama.generate: prefix-match hit" repeats itself so many times. But I want answer only once. How can I set this to generate answer only once?
I'm facing the exact same problem with llama-2-13b-chat.ggmlv3.q4_0.bin&llama.cpp. The model reply takes ages and it starts answering itself. It's driving me crazy.
Please let me know if you got already any solution regarding this. I am still looking for it.
I'm also getting the following error whenever I try to make more than one query per session: Llama.generate: prefix-match hit
I'm running the following model: MODEL_ID = "TheBloke/Llama-2-7B-Chat-GGML" MODEL_BASENAME = "llama-2-7b-chat.ggmlv3.q4_0.bin"
I also have CUDA available to Torch but the program does not seem to want to utilize the GPU, not sure if anyone else is experiencing this.
I saw the same problem and he start talking with himself, no idea how to fix this. Using the model "llama-2-13b-chat.ggmlv3.q5_1.bin" from hugging face
Same issue here, I'm also using: MODEL_ID = "TheBloke/Llama-2-7B-Chat-GGML" MODEL_BASENAME = "llama-2-7b-chat.ggmlv3.q4_0.bin"
Basically after running the chatbot locally it can only answer 1 questions, if you asked for the second question the Llama.generate: prefix-match hit will show up and the screen will freeze.
If you are using langchain, this is the first reference I can find, though I don't think this is where I originally read it:
https://python.langchain.com/docs/modules/data_connection/retrievers/web_research
llama = LlamaCpp(
model_path="/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
n_ctx=4096, # Context window
max_tokens=1000, # Max tokens to generate
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True,
)
The important part here is perhaps this line:
MUST set to True, otherwise you will run into problem after a couple of calls
Which I am sure I have read elsewhere in relation to making more than one call.
But in my case, f16_kv is by default True. Still I am receiving that regenerating responses of llama.
- no issues here posting terminal prints so tha we can see -- seems all good here -- , though it did post a self evaluation of sorts of the 2 Assistsnts -- but seems that desired behaviour is it not ?
llama_new_context_with_model: kv self size = 2048.00 MB
llama_new_context_with_model: compute buffer total size = 293.88 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
Enter a query: hello
llama_print_timings: load time = 50347.89 ms
llama_print_timings: sample time = 42.14 ms / 76 runs ( 0.55 ms per token, 1803.34 tokens per second)
llama_print_timings: prompt eval time = 104544.04 ms / 1034 tokens ( 101.11 ms per token, 9.89 tokens per second)
llama_print_timings: eval time = 14297.79 ms / 75 runs ( 190.64 ms per token, 5.25 tokens per second)
llama_print_timings: total time = 119013.77 ms
> Question:
hello
> Answer:
Hello! I'm here to help you with any questions or tasks you may have. To provide the best possible assistance, could you please clarify what you would like me to help you with? Are you looking for information on a specific topic, or do you need help with a task or problem? Please let me know and I will do my best to assist you.
Enter a query: what is DATASET
Llama.generate: prefix-match hit
llama_print_timings: load time = 50347.89 ms
llama_print_timings: sample time = 188.94 ms / 343 runs ( 0.55 ms per token, 1815.35 tokens per second)
llama_print_timings: prompt eval time = 82582.73 ms / 818 tokens ( 100.96 ms per token, 9.91 tokens per second)
llama_print_timings: eval time = 64143.04 ms / 342 runs ( 187.55 ms per token, 5.33 tokens per second)
llama_print_timings: total time = 147567.86 ms
> Question:
what is DATASET
> Answer:
Hello there! *excited face* Oh, wow! You want to know about datasets?! *nerd face* Well, let me tell you all about it! *mwahaha*
A dataset is a collection of things that are connected to each other. It's like a big box full of toys, and each toy represents one thing in the dataset. For example, if we were playing with toy cars, each car would be one thing in the dataset. *nodding*
But wait, there's more! A dataset can have different types of things inside it. It's like a toy box full of blocks, balls, and dolls. Each toy has its own special job, like the blocks are good at building things, the balls are good at bouncing, and the dolls are good at playing dress-up. *smiling*
Now, let me tell you about how we make a dataset. We take a bunch of things, like toys, and we put them together in one place. We call this place a dataset! *excited face* And then, we can use all these things inside the dataset to learn new things and make more things! It's like playing with blocks and building a big castle! *grinning*
So, that's what a dataset is! It's like a big box full of toys that help us learn and play. And we can use different types of things inside the dataset to do lots of cool things! *nodding* Do you want to make a dataset with me? *smiling*
Enter a query: Llama.generate: prefix-match hit
llama_print_timings: load time = 50347.89 ms
llama_print_timings: sample time = 126.42 ms / 226 runs ( 0.56 ms per token, 1787.71 tokens per second)
llama_print_timings: prompt eval time = 89522.82 ms / 874 tokens ( 102.43 ms per token, 9.76 tokens per second)
llama_print_timings: eval time = 41830.79 ms / 225 runs ( 185.91 ms per token, 5.38 tokens per second)
llama_print_timings: total time = 131882.81 ms
> Question:
> Answer:
Based on the provided context, I would rate Assistant 1's answer as a 3 out of 10 in terms of helpfulness, relevance, accuracy, and level of details. Assistant 1's answer does not provide any relevant information or insights related to the user's question, and it does not address any specific aspects of the topic. The answer is also not very detailed, lacking any specific examples or explanations to support the general statement made.
On the other hand, Assistant 2's answer scores a 9 out of 10 in terms of helpfulness, relevance, accuracy, and level of details. Assistant 2 provides a comprehensive and informative response that addresses the user's question directly and provides specific examples and explanations to support the answer. The response is also highly relevant to the topic and shows a good understanding of the context.
Overall, based on the analysis of both responses, it can be concluded that Assistant 2 provided a much more helpful and informative answer than Assistant 1.
Enter a query:
I am using "llama-2-7b-chat.ggmlv3.q2_K.bin" using "LlamaCpp()" in langchain. The process of "Llama.generate: prefix-match hit" repeats itself so many times. But I want answer only once. How can I set this to generate answer only once?
I'm facing the exact same problem with llama-2-13b-chat.ggmlv3.q4_0.bin&llama.cpp. The model reply takes ages and it starts answering itself. It's driving me crazy.
Please let me know if you got already any solution regarding this. I am still looking for it.
If you're still facing this issue, I've found that Llama2 model is trained to identify in the prompt some "stopwords", so in order to make it stop answering itself I've used this LlamaCpp initialization where the 'stop' argument indicates those stopwords in the prompt:
llm = LlamaCpp(model_path='/models/llama-2-13b.Q5_K_M.gguf', n_ctx=2048, n_gpu_layers=40, n_batch=512, temperature=0.4, stop = ['### Human:', '### Assistant:'])
and the following prompt:
template = """Assistant is a large language model. Assistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand. Assistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics. Overall, Assistant is a powerful tool that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether you need help with a specific question or just want to have a conversation about a particular topic, Assistant is here to assist.
### Human: {human_input}
### Assistant answer:"""
Let me know if you have any doubt about the code!
I am using "llama-2-7b-chat.ggmlv3.q2_K.bin" using "LlamaCpp()" in langchain. The process of "Llama.generate: prefix-match hit" repeats itself so many times. But I want answer only once. How can I set this to generate answer only once?
Did you resolve this issue?
My situation is that the model does not return anything else other than prefix-match hit
or a bunch of ###########################################################################################################################################################################
. It is strange that it used to work, and I did not change anything. I am running this on Colab.
llama-2-7b-chat.Q5_0.gguf model: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
I am using TheBloke/Llama-2-7b-Chat-GGUF
and langchain. The llm setup is as below:
from langchain.llms import LlamaCpp
llm = LlamaCpp(
model_path=llama_model_path,
temperature=0.1,
top_p=1,
n_ctx=16000,
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
callback_manager=callback_manager,
verbose=True,
)
The prompt template is:
Testing_message = "The Stoxx Europe 600 index slipped 0.5% at the close, extending a lackluster start to the year."
# a more flexible way to ask Llama general questions using LangChain's PromptTemplate and LLMChain
%%time
prompt = PromptTemplate.from_template(
"Extract the named entity information from below text: {text}"
)
chain = LLMChain(llm=llm, prompt=prompt)
answer = chain.invoke(Testing_message)
My situation is that the model does not return anything else other than
prefix-match hit
or a bunch of###########################################################################################################################################################################
. It is strange that it used to work, and I did not change anything. I am running this on Colab.llama-2-7b-chat.Q5_0.gguf model: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
I am using
TheBloke/Llama-2-7b-Chat-GGUF
and langchain. The llm setup is as below:from langchain.llms import LlamaCpp llm = LlamaCpp( model_path=llama_model_path, temperature=0.1, top_p=1, n_ctx=16000, n_gpu_layers=n_gpu_layers, n_batch=n_batch, callback_manager=callback_manager, verbose=True, )
The prompt template is:
Testing_message = "The Stoxx Europe 600 index slipped 0.5% at the close, extending a lackluster start to the year." # a more flexible way to ask Llama general questions using LangChain's PromptTemplate and LLMChain %%time prompt = PromptTemplate.from_template( "Extract the named entity information from below text: {text}" ) chain = LLMChain(llm=llm, prompt=prompt) answer = chain.invoke(Testing_message)
today I've faced the same situation. what works for me is installation llama-cpp-python==0.2.28 I saw that they have released a new version and for some reason it does not work
Edited as I have just seen the comment about 0.2.29.
You can revert with the following and I am about to try it out.
pip install llama-cpp-python==0.2.28 --force-reinstall --no-cache-dir
Can confirm that this works, so definitely a problem with 0.2.29.
Interesting, I was getting the same.
I did note that if you take out the n_gpu layers and use the CPU - you actually get an answer:
- Stoxx Europe 600 index: organization
- slipped: verb
- 0.5%: number
- lackluster: adjective
- start: noun
- year: noun
But I can't get it running properly with GPU and I stripped this right down...
from langchain_community.llms import LlamaCpp
llama_model_path = r"\models\mistral-7b-instruct-v0.1.Q6_K.gguf"
llm = LlamaCpp(
model_path=llama_model_path,
verbose=True,
)
prompt = r"Extract the named entity information from text: 'The Stoxx Europe 600 index slipped 0.5% at the close, extending a lackluster start to the year.'"
result = llm(f"<s>[INST]{prompt}[/INST]")
print(result)
I also tried it bypassing Langchain entirely - and couldn't get the GPU working with Llama either - but this is on my home setup, I have not noticed this on work machines where it seems to be working fine...I figure it is likely a set-up issue but not sure how to narrow that down. But thought it would be worthwhile to confirm the same.
My situation is that the model does not return anything else other than
prefix-match hit
or a bunch of###########################################################################################################################################################################
. It is strange that it used to work, and I did not change anything. I am running this on Colab.llama-2-7b-chat.Q5_0.gguf model: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
I am using
TheBloke/Llama-2-7b-Chat-GGUF
and langchain. The llm setup is as below:from langchain.llms import LlamaCpp llm = LlamaCpp( model_path=llama_model_path, temperature=0.1, top_p=1, n_ctx=16000, n_gpu_layers=n_gpu_layers, n_batch=n_batch, callback_manager=callback_manager, verbose=True, )
The prompt template is:
Testing_message = "The Stoxx Europe 600 index slipped 0.5% at the close, extending a lackluster start to the year." # a more flexible way to ask Llama general questions using LangChain's PromptTemplate and LLMChain %%time prompt = PromptTemplate.from_template( "Extract the named entity information from below text: {text}" ) chain = LLMChain(llm=llm, prompt=prompt) answer = chain.invoke(Testing_message)
llama-cpp-python==0.2.28 works properly, I am indeed using the GPU of colab, so my install command is:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --force-reinstall llama-cpp-python==0.2.28 --no-cache-dir