unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

How to inference with the converted GGUF using llama-cpp?

Open mk0223 opened this issue 1 year ago • 1 comments

I would appreciate if anyone can help with the following problem when using the converted GGUF for inference.

I found that inferencing with llama-cpp generates a different result from inferencing with the saved LoRA adapters. I am using both Q4 quantized model.

For inferencing with LoRA, I kept the alpaca_prompt format:

if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = True,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        instruct, # instruction
        description, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 8000, use_cache = True,temperature=0)
tokenizer.batch_decode(outputs)

For inferencing with llama-cpp, I used its chat completion since I didn't find a way to retain the alpaca_prompt format:

llm = Llama(
      model_path=SAVED_PATH,
      n_gpu_layers=-1, # Uncomment to use GPU acceleration
      seed=1, # Uncomment to set a specific seed
      n_ctx=2048, # Uncomment to increase the context window
      # tokenizer=LlamaHFTokenizer.from_pretrained(SAVED_PATH) # is this necessary???
)
...
output = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": instruct},
          {"role": "user","content": description}
      ],
      temperature=0,
      max_tokens=8000
)

Is it necessary to retain the alpaca_prompt format or to convert the tokenizers from unsloth to llama-cpp?

In #(https://github.com/abetlen/llama-cpp-python), it is mentioned that: "Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is required to provide HF Tokenizer for functionary. The LlamaHFTokenizer class can be initialized and passed into the Llama class. This will override the default llama.cpp tokenizer used in Llama class. The tokenizer files are already included in the respective HF repositories hosting the gguf files." I don't quite understand if such descrepancy exist since the Unsloth demo notebook doesn't seem to mention.

Thanks!

mk0223 avatar May 17 '24 08:05 mk0223

A good idea to use llama-cpp's Python module - ill make an example

danielhanchen avatar May 17 '24 18:05 danielhanchen

I know I'm late here but I'll leave this here in case anyone finds this useful :D. Here's what I'd do to use llama.cpp (and what I actually do to use vllm tbh) for custom models:

What I'd do

You can use the /completions endpoint with llama.cpp server so, after compiling you could use it in order to use the prompt format that you want. This means that you'll only need to use the openai module in Python. In this case I'll show with llama3-Instruct.gguf for the example's purpose.

Launch llama.cpp server

In this case we do little configuration:

will@fedora:~/libs/llama.cpp/build/bin$ ./llama-server -m ~/llm_models/Meta-Llama-3-8B-Instruct.Q8_0.gguf -t 8 --host 127.0.0.1 --port 8080

Example code to use /completions endpoint

from openai import OpenAI

OPENAPI_CONF = {
    "base_url" : "http://127.0.0.1:8080/v1",
    "api_key" : "abc123"
}

def base_model_generate(OPENAPI_CONF, model, prompt, max_tokens, stop_list):
    """
    Function for base model generation based on prompt. Can be used with fine tuned models applying their corresponding templates
    
    Arguments:
        OPENAPI_CONF (dict) : Dictionary with url and API key for our OPENAI like endpoint
        model (str)         : Model name
        prompt (str)       : Prompt with applied model template
        max_tokens (int) : Maximum tokens generated by the model. Prompt tokens are taken into account
        stop_list (list)     : List of strings that define the strings that will force generation to stop.

    Returns:
        yields characters generated
    """
    client = OpenAI(
        base_url=OPENAPI_CONF['base_url'],
        api_key=OPENAPI_CONF['api_key']
    )

    response = client.completions.create(
        model=model,
        prompt=prompt,
        max_tokens=max_tokens,
        stream=True,
        stop=stop_list
    )

    for chunk in response:
        chunk_content = str(chunk.content)
        yield chunk_content

#### IMPORTANT: Here you'd apply any prompt format you'd want for YOUR specific model. In this case, since it's llama3 instruct I use the chat template
prompt = """<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

for result in base_model_generate(OPENAPI_CONF=OPENAPI_CONF, model="Meta-Llama-3-8B-Instruct.Q8_0.gguf", prompt=prompt, max_tokens=2048, stop_list=["<|eot_id|>"]):
    print(result, end="")

After using this code (obviously with llama.cpp server launched) I get the following result:

will@fedora:~/Projects$ python3 test.py 
I am a helpful assistant! I'm an AI designed to assist and communicate with humans in a helpful and informative way. I can provide information on a wide range of topics, answer questions, and even help with tasks and problems. My goal is to be a useful tool for you, whether you need assistance with something specific, or just want to chat and pass the time. I'm here to help, so feel free to ask me anything!

Hope it helps someone!

Wiill007 avatar Nov 20 '24 19:11 Wiill007