auto-round icon indicating copy to clipboard operation
auto-round copied to clipboard

[support] 2bit dequantize on xpu is slow

Open xiaohoua opened this issue 2 weeks ago • 4 comments

use

autoround = AutoRound(
        model, 
        tokenizer, 
        dataset=calib_data, 
        bits=2,             
        group_size=128,     
        sym=False,           
        batch_size=batch_size, 
        seqlen=seqlen,
        n_samples=len(calib_data),
        iters=200,          
    )
autoround.quantize()

to quant model. here is my infer.py:

infer.py
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound 
import time

# 1. 设置路径 (请确保这里指向上一步生成的 2-bit 模型文件夹)
model_path = r"D:\StreamingMedia\quantize\2bits\autoround\Qwen3-0.6B_2bit_parquet"

print(f"正在加载 2-Bit 模型: {model_path}")

# 加载模型
# AutoRound 保存的模型通常会自动适配 device_map
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="xpu", 
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# 准备输入
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# 2. 推理
print(f"开始推理 (设备: {model.device})...")
start_time = time.time()

with torch.no_grad():
    # 生成 50 个新 token
    outputs = model.generate(**inputs, max_new_tokens=100)
print(model.model.layers[0].self_attn.q_proj)
print(f"推理耗时: {time.time() - start_time:.4f} 秒")


result = tokenizer.decode(outputs[0][0], skip_special_tokens=True)

print("-" * 30)
print("生成结果 (2-Bit):")
print(result)
print("-" * 30)
output: Image

when max_new_tokens=100, infer time is 43~50s. only 2 token/s. here is a warning in output: 2025-11-26 17:06:13 WARNING _logger.py L68: pip install "triton>=2.0" but triton on intel xpu with auto-round may have some issue

so do we have a plan on 2bit dequantize on xpu with triton or some methods?

xiaohoua avatar Nov 26 '25 09:11 xiaohoua

Yes, your code is currently running on our fallback PyTorch backend, which is quite slow. We will provide our own kernel in the next release, and it will not depend on Triton. However, since you’re on Windows, there might be some delays. Maybe you could use gguf format for now.

wenhuach21 avatar Nov 26 '25 09:11 wenhuach21

thanks for your quick reply. So you mean that the 2-bit dequantization speed of the gguf format model on xpu is relatively faster now? Can you share a demo or link, i'm not sure how to use that.

xiaohoua avatar Nov 26 '25 09:11 xiaohoua

GGUF runs quite fast on CPU. You can refer to Ollama and quantize your model to GGUF using AutoRound


auto-round --model xxx --format "gguf:q4_k_m" --iters 0

Switch the format to gguf:q2_k_s if you want 2-bit quantization, although this is not recommended for smaller models, especially those under 7B.

wenhuach21 avatar Nov 26 '25 09:11 wenhuach21

OK, so gguf is for cpu. another question, Will you open source your kernel implementation in the next release version?

xiaohoua avatar Nov 26 '25 10:11 xiaohoua

OK, so gguf is for cpu. another question, Will you open source your kernel implementation in the next release version?

Thanks for your interests. We don't have a plan to release our reference kernel at this moment, as the main focus of AutoRound is to provide a leading quantization algorithm for LLMs.

hshen14 avatar Nov 26 '25 23:11 hshen14