codellama instruct model's performance become poor when switching to different format

I use 7b and 13b-instruct model to do program analysis task. When I use original format 7b-Instruct and 13b-Instruct model, following example_chat_completion.py, everything is all right.

But when I use huggingface version model, use the generate api. I found the model's response is much worse than origin format one. Also, the same thing happen when deploy online with tool like lmdeploy.

It seems only locally run original format model give me a satisfiable answer. How to fix this?

Oct 19 '23 13:10 for-just-we

In the Huggingface generation pipeline, Are you using Instruct prompt instructions ?

<s>[INST] user message 1 [/INST] response 1 </s><s>[INST] user message 2 [/INST] response 2 </s>

In example_chat_completion.py, this thing is handled internally while in huggingface we have to do manually.

Nov 30 '23 20:11 humza-sami

In the Huggingface generation pipeline, Are you using Instruct prompt instructions ?

<s>[INST] user message 1 [/INST] response 1 </s><s>[INST] user message 2 [/INST] response 2 </s>

In example_chat_completion.py, this thing is handled internally while in huggingface we have to do manually.

yes, I actually also tried different prompt method. For example, I tried follow example-text-completion.py, while the response may not be as good as follow chat_completion, the answer is sort of reliable. But for huggingface model, the answer is always worse.

Dec 11 '23 03:12 for-just-we

Can you share code for inference ?

Dec 11 '23 06:12 humza-sami

inference

The prompt is like:

Pay special attention to semantic similarity between the xx,xx,xx 

...

...

Your analysis should determine if there is a substantial possibility of the indirect call effectively invoking the function. 
Provide your answer with only 'yes' (for likely), or 'no' (for unlikely).

The code for inference is the same as example_chat.py for llama format. For huggingface format. I transform the prompt into str and use the same code in huggingface example

Dec 23 '23 07:12 for-just-we

@for-just-we can you post some code for inference with a relevant example prompt for both this repo and the HF model so we can see where things might go wrong?

Dec 23 '23 10:12 jgehring

@for-just-we It would be helpful if you post your inference code here. Anyways, Could you try this code snippet and check if it is producing some better results.

from transformers import AutoTokenizer
import transformers
import torch

model = "YOUR MODEL NAME"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)


prompt = """<s>[INST] Pay special attention to semantic similarity between the xx,xx,xx\n\nYour analysis should determine if there is a substantial possibility of the indirect call effectively invoking the function. 
Provide your answer with only 'yes' (for likely), or 'no' (for unlikely). [/INST]"""

sequences = pipeline(
    prompt,
    do_sample=True,
    top_k=10,
    temperature=0.1,
    top_p=0.95,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Dec 24 '23 08:12 humza-sami

codellama codellama copied to clipboard

instruct model's performance become poor when switching to different format

codellama
codellama copied to clipboard