AutoAWQ
AutoAWQ copied to clipboard
awq quantization is not fully optimized yet. The speed can be slower than non-quantized models
When i ran quantize code for llama3-70b-instruct. It was successfull, but when i used vllm load quantized model. I got a warning: awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
Does that affect the processing speed of this model?
This is my code:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'meta-llama/Meta-Llama-3-70B-Instruct'
quant_path = 'Meta-Llama-3-70B-Instruct-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
vllm==0.4.3, vllm-flash-attn==2.5.8.post2, nccl==2.20.5
Hi @jackNhat, AWQ models are underoptimized in vLLM. The good news is that a the main branch has a new optimization that enables up to 2.59x more performance - this should be released in vllm==0.5.3 in the coming days.
Many thanks, i am very looking forward