AutoAWQ
AutoAWQ copied to clipboard
Converting finetuned Llama 3.1 using LORA into AWQ
I have finetuned the llama 3.1 using unsloth. Then, i merged and unloaded the LORA model and pushed to the hub.
Now when i tried quantizing it using:
from awq import AutoAWQForCausalLM
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM",
}
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path, low_cpu_mem_usage=True, use_cache=False,token=access_token
)
tokenizer = AutoTokenizer.from_pretrained(model_path, token=access_token)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
But this is showing : RuntimeError: output with shape [8388608, 1] doesn't match the broadcast shape [8388608, 4096]
I am confused and not sure what the issue is. Can anyone please guide me?
If you have a normal FP16/BF16 model, this does not happen. I would suggest you check if the model can run inference with Huggingface libraries as a first step
@casper-hansen
Yeah, I am able to run inference with huggingface model. As can be seen on the screenshot.
Not sure, what is the issue with converting it into the AWQ format as i want to test AWQ with vLLM.
Important note to be considered is that i have used unsloth for finetuning utilizing LORA and save model using merge_and_unload() method of peftmodel.