[support] 2bit dequantize on xpu is slow
use
autoround = AutoRound(
model,
tokenizer,
dataset=calib_data,
bits=2,
group_size=128,
sym=False,
batch_size=batch_size,
seqlen=seqlen,
n_samples=len(calib_data),
iters=200,
)
autoround.quantize()
to quant model. here is my infer.py:
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound
import time
# 1. 设置路径 (请确保这里指向上一步生成的 2-bit 模型文件夹)
model_path = r"D:\StreamingMedia\quantize\2bits\autoround\Qwen3-0.6B_2bit_parquet"
print(f"正在加载 2-Bit 模型: {model_path}")
# 加载模型
# AutoRound 保存的模型通常会自动适配 device_map
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="xpu",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# 准备输入
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
# 2. 推理
print(f"开始推理 (设备: {model.device})...")
start_time = time.time()
with torch.no_grad():
# 生成 50 个新 token
outputs = model.generate(**inputs, max_new_tokens=100)
print(model.model.layers[0].self_attn.q_proj)
print(f"推理耗时: {time.time() - start_time:.4f} 秒")
result = tokenizer.decode(outputs[0][0], skip_special_tokens=True)
print("-" * 30)
print("生成结果 (2-Bit):")
print(result)
print("-" * 30)
when max_new_tokens=100, infer time is 43~50s. only 2 token/s.
here is a warning in output: 2025-11-26 17:06:13 WARNING _logger.py L68: pip install "triton>=2.0"
but triton on intel xpu with auto-round may have some issue
so do we have a plan on 2bit dequantize on xpu with triton or some methods?
Yes, your code is currently running on our fallback PyTorch backend, which is quite slow. We will provide our own kernel in the next release, and it will not depend on Triton. However, since you’re on Windows, there might be some delays. Maybe you could use gguf format for now.
thanks for your quick reply. So you mean that the 2-bit dequantization speed of the gguf format model on xpu is relatively faster now? Can you share a demo or link, i'm not sure how to use that.
GGUF runs quite fast on CPU. You can refer to Ollama and quantize your model to GGUF using AutoRound
auto-round --model xxx --format "gguf:q4_k_m" --iters 0
Switch the format to gguf:q2_k_s if you want 2-bit quantization, although this is not recommended for smaller models, especially those under 7B.
OK, so gguf is for cpu. another question, Will you open source your kernel implementation in the next release version?
OK, so gguf is for cpu. another question, Will you open source your kernel implementation in the next release version?
Thanks for your interests. We don't have a plan to release our reference kernel at this moment, as the main focus of AutoRound is to provide a leading quantization algorithm for LLMs.