unsloth
unsloth copied to clipboard
How to inference with the converted GGUF using llama-cpp?
I would appreciate if anyone can help with the following problem when using the converted GGUF for inference.
I found that inferencing with llama-cpp generates a different result from inferencing with the saved LoRA adapters. I am using both Q4 quantized model.
For inferencing with LoRA, I kept the alpaca_prompt format:
if False:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# alpaca_prompt = You MUST copy from above!
inputs = tokenizer(
[
alpaca_prompt.format(
instruct, # instruction
description, # input
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 8000, use_cache = True,temperature=0)
tokenizer.batch_decode(outputs)
For inferencing with llama-cpp, I used its chat completion since I didn't find a way to retain the alpaca_prompt format:
llm = Llama(
model_path=SAVED_PATH,
n_gpu_layers=-1, # Uncomment to use GPU acceleration
seed=1, # Uncomment to set a specific seed
n_ctx=2048, # Uncomment to increase the context window
# tokenizer=LlamaHFTokenizer.from_pretrained(SAVED_PATH) # is this necessary???
)
...
output = llm.create_chat_completion(
messages = [
{"role": "system", "content": instruct},
{"role": "user","content": description}
],
temperature=0,
max_tokens=8000
)
Is it necessary to retain the alpaca_prompt format or to convert the tokenizers from unsloth to llama-cpp?
In #(https://github.com/abetlen/llama-cpp-python), it is mentioned that: "Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is required to provide HF Tokenizer for functionary. The LlamaHFTokenizer class can be initialized and passed into the Llama class. This will override the default llama.cpp tokenizer used in Llama class. The tokenizer files are already included in the respective HF repositories hosting the gguf files." I don't quite understand if such descrepancy exist since the Unsloth demo notebook doesn't seem to mention.
Thanks!
A good idea to use llama-cpp's Python module - ill make an example
I know I'm late here but I'll leave this here in case anyone finds this useful :D. Here's what I'd do to use llama.cpp (and what I actually do to use vllm tbh) for custom models:
What I'd do
You can use the /completions endpoint with llama.cpp server so, after compiling you could use it in order to use the prompt format that you want. This means that you'll only need to use the openai module in Python. In this case I'll show with llama3-Instruct.gguf for the example's purpose.
Launch llama.cpp server
In this case we do little configuration:
will@fedora:~/libs/llama.cpp/build/bin$ ./llama-server -m ~/llm_models/Meta-Llama-3-8B-Instruct.Q8_0.gguf -t 8 --host 127.0.0.1 --port 8080
Example code to use /completions endpoint
from openai import OpenAI
OPENAPI_CONF = {
"base_url" : "http://127.0.0.1:8080/v1",
"api_key" : "abc123"
}
def base_model_generate(OPENAPI_CONF, model, prompt, max_tokens, stop_list):
"""
Function for base model generation based on prompt. Can be used with fine tuned models applying their corresponding templates
Arguments:
OPENAPI_CONF (dict) : Dictionary with url and API key for our OPENAI like endpoint
model (str) : Model name
prompt (str) : Prompt with applied model template
max_tokens (int) : Maximum tokens generated by the model. Prompt tokens are taken into account
stop_list (list) : List of strings that define the strings that will force generation to stop.
Returns:
yields characters generated
"""
client = OpenAI(
base_url=OPENAPI_CONF['base_url'],
api_key=OPENAPI_CONF['api_key']
)
response = client.completions.create(
model=model,
prompt=prompt,
max_tokens=max_tokens,
stream=True,
stop=stop_list
)
for chunk in response:
chunk_content = str(chunk.content)
yield chunk_content
#### IMPORTANT: Here you'd apply any prompt format you'd want for YOUR specific model. In this case, since it's llama3 instruct I use the chat template
prompt = """<|start_header_id|>system<|end_header_id|>
You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
for result in base_model_generate(OPENAPI_CONF=OPENAPI_CONF, model="Meta-Llama-3-8B-Instruct.Q8_0.gguf", prompt=prompt, max_tokens=2048, stop_list=["<|eot_id|>"]):
print(result, end="")
After using this code (obviously with llama.cpp server launched) I get the following result:
will@fedora:~/Projects$ python3 test.py
I am a helpful assistant! I'm an AI designed to assist and communicate with humans in a helpful and informative way. I can provide information on a wide range of topics, answer questions, and even help with tasks and problems. My goal is to be a useful tool for you, whether you need assistance with something specific, or just want to chat and pass the time. I'm here to help, so feel free to ask me anything!
Hope it helps someone!