AutoAWQ
AutoAWQ copied to clipboard
[Performance degrade]phi-3-medium-128k-instruct after awq quantized, then output repetitively
phi-3-medium-128k-instruct was quantized by autoawq the quant-config:
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } nothing changed in the quantize.py file
then, run the generator.py as the following:
from awq import AutoAWQForCausalLM
from awq.utils.utils import get_best_device
from transformers import AutoTokenizer, TextStreamer
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--quant_path',type=str, help='The Quantized Model Path')
parser.add_argument('--prompt',type=str, help='Prompt for generator')
args = parser.parse_args()
quant_path = args.quant_path
# Load model
if get_best_device() == "cpu":
model = AutoAWQForCausalLM.from_quantized(quant_path, use_qbits=True, fuse_layers=False)
else:
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=False)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# prompt = "You're standing on the surface of the Earth. "\
# "You walk one mile south, one mile west and one mile north. "\
# "You end up exactly where you started. Where are you?"
prompt = args.prompt
chat = [
{"role": "system", "content": "You are a concise assistant that helps answer questions."},
{"role": "user", "content": prompt},
]
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|endoftext|>"),
tokenizer.convert_tokens_to_ids("<|end|>"),
tokenizer.convert_tokens_to_ids("<|assistant|>"),
]
tokens = tokenizer.apply_chat_template(
chat,
return_tensors="pt"
)
tokens = tokens.to(get_best_device())
# Generate output
generation_output = model.generate(
tokens,
streamer=streamer,
max_new_tokens=1024,
eos_token_id=terminators,
do_sample=True,
temperature=0.2,
top_p=0.95,
repetition_penalty=1.2
)
print(generation_output)
I changed the fuse_layer=False, because if not, the GPU can't load the model (A100 40GB)
python3 quan-phi3-inference2.py --quant_path ./phi-3-128k-medium-autoawq --prompt "tell me some advice when I workout" the output become repetitively:
so, any tips for this issue?