AutoAWQ Quantizing DeepSeek-R1-Distill-Qwen-7B produces garbage and repetitive tokens

I'm trying to reproduce the AWQ quantization results by @casper-hansen with the DeepSeek-R1-Distill-Qwen-7B, that was published here: casperhansen/deepseek-r1-distill-qwen-7b-awq

This model is very good already, but I wanted to calibrate using my custom data. As a first step, I tried to follow the simple examples like this:

from datasets import load_dataset
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-7B'
quant_path = './quantized/deepseek-r1-distill-qwen-7b-awq-new'

# Quantization config
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

I believe this is the same way the casperhansen/deepseek-r1-distill-qwen-7b-awq as produced, according to #701.

The model produces garbage and repetitive results like this:

While the one quantized by @casper-hansen is pretty good and produces correct answers on AIME'24.

Result of pip freeze:

accelerate==1.4.0
aiohappyeyeballs==2.5.0
aiohttp==3.11.13
aiosignal==1.3.2
attrs==25.1.0
autoawq==0.2.8
autoawq_kernels==0.0.9
certifi==2025.1.31
charset-normalizer==3.4.1
contourpy==1.3.1
cycler==0.12.1
datasets==3.3.2
dill==0.3.8
einops==0.8.1
filelock==3.17.0
flash_attn==2.7.4.post1
fonttools==4.56.0
frozenlist==1.5.0
fsspec==2024.12.0
huggingface-hub==0.29.3
idna==3.10
Jinja2==3.1.6
jsonlines==4.0.0
kiwisolver==1.4.8
MarkupSafe==3.0.2
matplotlib==3.10.1
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
networkx==3.4.2
numpy==2.2.3
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
packaging==24.2
pandas==2.2.3
pillow==11.1.0
propcache==0.3.0
psutil==7.0.0
pyarrow==19.0.1
pyparsing==3.2.1
python-dateutil==2.9.0.post0
pytz==2025.1
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
safetensors==0.5.3
six==1.17.0
sympy==1.13.1
tokenizers==0.21.0
torch==2.6.0
tqdm==4.67.1
transformers==4.47.1
triton==3.2.0
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
xxhash==3.5.0
yarl==1.18.3
zstandard==0.23.0

Hardware: 2x3090 RTX

Mar 11 '25 22:03 hav4ik

I'm having the same problem. It would be amazing to have the code that generated that quantization

Mar 19 '25 01:03 FeSens

@FeSens Hi, were you able to fix this?

@casper-hansen would be awesome if you could please share how/which autoawq version you used while quantizing the DeepSeek-Distill models!

Mar 21 '25 21:03 hav4ik

@hav4ik @FeSens Did you guys try with supplying calibration data?

Mar 24 '25 14:03 tvmsandy33

@tvmsandy33 yes, I tried to supply some reasoning traces produced by the BF16 model. The prompt template, system prompt, etc. I kept the same as during evaluation time (where it outputted gibberish)

I also tried on the default calibration data (Pile Valid) that was mentioned in #701 (basically the default calibration code)

Mar 26 '25 20:03 hav4ik