[QA] Where to download DeepSeek-R1 gptq model?
How to Download DeepSeek-v1 gptq quanted model?
You can visit https://huggingface.co/models?search=gptq to download our DeepSeek R1 distlled 7B model but we currently do not provide the full R1 model. You can use our toolkit to quantize your down R1 model.
You can visit https://huggingface.co/models?search=gptq to download our DeepSeek R1 distlled 7B model but we currently do not provide the full R1 model. You can use our toolkit to quantize your down R1 model.
Deepseek.ai has released the FP8 version. Can our tools work with it directly? Have you considered releasing a DeepSeek R1 GPTQ quantized version? It should be very popular.
You can use the bf16 version of R1 to GPTQ quantize. We do not have large H100+ gpu to test FP8 model load. 4090 has too little vram.
https://huggingface.co/unsloth/DeepSeek-R1-BF16/tree/main
Great, Thanks!
One more question, have you tested if there are any issues with DeepSeek R1 GPTQ inference? Can it be used for inference with the vllm serve --quantization gptq method?
One more question, have you tested if there are any issues with DeepSeek R1 GPTQ inference? Can it be used for inference with the
vllm serve --quantization gptqmethod?
There are no technical reasons why GPTQ quantized R1 cannot run on vLLM or SGLang.
@Qubitium @Rane2021
Hello, I am quite interested in your work. I would like to ask you a few questions:
- Does this link provide the model compressed by your algorithm? https://huggingface.co/OPEA/DeepSeek-R1-int4-gptq-sym-inc
- I saw in the demo that it supports up to 3-bit quantization. Can it be lower bit?
- What is the difference between your work and https://github.com/IST-DASLab/gptq ? I would like to see the technical details of your paper.
You can visit https://huggingface.co/models?search=gptq to download our DeepSeek R1 distlled 7B model but we currently do not provide the full R1 model. You can use our toolkit to quantize your down R1 model.
Could you please tell me which deepseek-7B model you can compress? If convenient, please provide the link of 7B model.
@hsb1995
- The link you referred to is a GPTQ quant model made by AutoRound. However, that model has not been benchmarked, that i am aware of so I can't say one or the other how good it is. AutoRound does not use the same algorithm but generated the a model format that is compatible with GPTQ.
- Please check https://github.com/ModelCloud/GPTQModel#citation for link to the papers. We use the same original GPTQ alogorithm pioneered by IST-DASLab.
- Please check our readme for link to our quantized DeepSeek 7B model with full-benchmarks. https://github.com/ModelCloud/GPTQModel#quality-gptq-4bit-50-bpw-can-match-bf16
https://arxiv.org/abs/2210.17323 Hello professor, is this paper your project's paper?
https://arxiv.org/abs/2210.17323 Hello professor, is this paper your project's paper?
@Qubitium
https://arxiv.org/abs/2210.17323 Hello professor, is this paper your project's paper?
This paper was written by the original researchers of GPTQ. GPTQModel is code, based on the original code from the original research team plus many modifications on usage, inference, and quantization.
You can use the bf16 version of R1 to GPTQ quantize. We do not have large H100+ gpu to test FP8 model load. 4090 has too little vram.
https://huggingface.co/unsloth/DeepSeek-R1-BF16/tree/main
hello,to quantize DeepSeek R1 BF16 model to w8a8 using GPTQModel, is there a minimum machine specification recommended?
You can use the bf16 version of R1 to GPTQ quantize. We do not have large H100+ gpu to test FP8 model load. 4090 has too little vram.
https://huggingface.co/unsloth/DeepSeek-R1-BF16/tree/main
hello,to quantize DeepSeek R1 BF16 model to w8a8 using GPTQModel, is there a minimum machine specification recommended?
Yes. Git as big of a single gpu as you possibly can. Ram you need 2TB
You can use the bf16 version of R1 to GPTQ quantize. We do not have large H100+ gpu to test FP8 model load. 4090 has too little vram. https://huggingface.co/unsloth/DeepSeek-R1-BF16/tree/main
hello,to quantize DeepSeek R1 BF16 model to w8a8 using GPTQModel, is there a minimum machine specification recommended?
Yes. Git as big of a single gpu as you possibly can. Ram you need 2TB
you mean GPU ram or CPU ram ? by the way whatever cpu or gpu, single machine 2t ram is not easy to achieve for most people 😂
You can use the bf16 version of R1 to GPTQ quantize. We do not have large H100+ gpu to test FP8 model load. 4090 has too little vram. https://huggingface.co/unsloth/DeepSeek-R1-BF16/tree/main
hello,to quantize DeepSeek R1 BF16 model to w8a8 using GPTQModel, is there a minimum machine specification recommended?
Yes. Git as big of a single gpu as you possibly can. Ram you need 2TB
you mean GPU ram or CPU ram ? by the way whatever cpu or gpu, single machine 2t ram is not easy to achieve for most people 😂
DeepSeek v1 at BF16 is huge. 2TB is cpu vram and you need 80GB+ single GPU for quantization.
- The link you referred to is a GPTQ quant model made by AutoRound. However, that model has not been benchmarked, that i am aware of so I can't say one or the other how good it is. AutoRound does not use the same algorithm but generated the a model format that is compatible with GPTQ.
- Please check https://github.com/ModelCloud/GPTQModel#citation for link to the papers. We use the same original GPTQ alogorithm pioneered by IST-DASLab.
- Please check our readme for link to our quantized DeepSeek 7B model with full-benchmarks. https://github.com/ModelCloud/GPTQModel#quality-gptq-4bit-50-bpw-can-match-bf16
Hello, May I ask which model you'll choose to quantize in your collection as https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2? Or those collection is just a prove of ability?
What make I wonder is that there are multi-version of QwQ-32B and DeepSeek R1-Distill-7B, but none of DeepSeek R1-Distill-32B exists. So is there some problem for quantizing DeepSeek R1-Distill-32B to match performance of bf16 format?
Yes. Git as big of a single gpu as you possibly can. Ram you need 2TB
@Qubitium I am trying to quantize on h200. With single gpu ram of 140GB. CPU RAM 1.8TB
However the calibration data I can use is only 128 samples. if I increase the calibration samples, model seems to run out of GPU memory. Any solution to this?