Quantizing DeepSeek-R1-Distill-Qwen-7B produces garbage and repetitive tokens
I'm trying to reproduce the AWQ quantization results by @casper-hansen with the DeepSeek-R1-Distill-Qwen-7B, that was published here: casperhansen/deepseek-r1-distill-qwen-7b-awq
This model is very good already, but I wanted to calibrate using my custom data. As a first step, I tried to follow the simple examples like this:
from datasets import load_dataset
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-7B'
quant_path = './quantized/deepseek-r1-distill-qwen-7b-awq-new'
# Quantization config
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
I believe this is the same way the casperhansen/deepseek-r1-distill-qwen-7b-awq as produced, according to #701.
The model produces garbage and repetitive results like this:
While the one quantized by @casper-hansen is pretty good and produces correct answers on AIME'24.
Result of pip freeze:
accelerate==1.4.0
aiohappyeyeballs==2.5.0
aiohttp==3.11.13
aiosignal==1.3.2
attrs==25.1.0
autoawq==0.2.8
autoawq_kernels==0.0.9
certifi==2025.1.31
charset-normalizer==3.4.1
contourpy==1.3.1
cycler==0.12.1
datasets==3.3.2
dill==0.3.8
einops==0.8.1
filelock==3.17.0
flash_attn==2.7.4.post1
fonttools==4.56.0
frozenlist==1.5.0
fsspec==2024.12.0
huggingface-hub==0.29.3
idna==3.10
Jinja2==3.1.6
jsonlines==4.0.0
kiwisolver==1.4.8
MarkupSafe==3.0.2
matplotlib==3.10.1
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
networkx==3.4.2
numpy==2.2.3
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
packaging==24.2
pandas==2.2.3
pillow==11.1.0
propcache==0.3.0
psutil==7.0.0
pyarrow==19.0.1
pyparsing==3.2.1
python-dateutil==2.9.0.post0
pytz==2025.1
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
safetensors==0.5.3
six==1.17.0
sympy==1.13.1
tokenizers==0.21.0
torch==2.6.0
tqdm==4.67.1
transformers==4.47.1
triton==3.2.0
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
xxhash==3.5.0
yarl==1.18.3
zstandard==0.23.0
Hardware: 2x3090 RTX
I'm having the same problem. It would be amazing to have the code that generated that quantization
@FeSens Hi, were you able to fix this?
@casper-hansen would be awesome if you could please share how/which autoawq version you used while quantizing the DeepSeek-Distill models!
@hav4ik @FeSens Did you guys try with supplying calibration data?
@tvmsandy33 yes, I tried to supply some reasoning traces produced by the BF16 model. The prompt template, system prompt, etc. I kept the same as during evaluation time (where it outputted gibberish)
I also tried on the default calibration data (Pile Valid) that was mentioned in #701 (basically the default calibration code)