GPTQModel icon indicating copy to clipboard operation
GPTQModel copied to clipboard

[QA] Where to download DeepSeek-R1 gptq model?

Open Rane2021 opened this issue 11 months ago • 18 comments

How to Download DeepSeek-v1 gptq quanted model?

Rane2021 avatar Feb 12 '25 06:02 Rane2021

You can visit https://huggingface.co/models?search=gptq to download our DeepSeek R1 distlled 7B model but we currently do not provide the full R1 model. You can use our toolkit to quantize your down R1 model.

Qubitium avatar Feb 12 '25 06:02 Qubitium

You can visit https://huggingface.co/models?search=gptq to download our DeepSeek R1 distlled 7B model but we currently do not provide the full R1 model. You can use our toolkit to quantize your down R1 model.

Deepseek.ai has released the FP8 version. Can our tools work with it directly? Have you considered releasing a DeepSeek R1 GPTQ quantized version? It should be very popular.

Rane2021 avatar Feb 12 '25 06:02 Rane2021

You can use the bf16 version of R1 to GPTQ quantize. We do not have large H100+ gpu to test FP8 model load. 4090 has too little vram.

https://huggingface.co/unsloth/DeepSeek-R1-BF16/tree/main

Qubitium avatar Feb 12 '25 08:02 Qubitium

Great, Thanks!

Rane2021 avatar Feb 12 '25 08:02 Rane2021

One more question, have you tested if there are any issues with DeepSeek R1 GPTQ inference? Can it be used for inference with the vllm serve --quantization gptq method?

Rane2021 avatar Feb 12 '25 09:02 Rane2021

One more question, have you tested if there are any issues with DeepSeek R1 GPTQ inference? Can it be used for inference with the vllm serve --quantization gptq method?

There are no technical reasons why GPTQ quantized R1 cannot run on vLLM or SGLang.

Qubitium avatar Feb 12 '25 09:02 Qubitium

@Qubitium @Rane2021

Hello, I am quite interested in your work. I would like to ask you a few questions:

  1. Does this link provide the model compressed by your algorithm? https://huggingface.co/OPEA/DeepSeek-R1-int4-gptq-sym-inc
  2. I saw in the demo that it supports up to 3-bit quantization. Can it be lower bit?
  3. What is the difference between your work and https://github.com/IST-DASLab/gptq ? I would like to see the technical details of your paper.

hsb1995 avatar Feb 24 '25 01:02 hsb1995

You can visit https://huggingface.co/models?search=gptq to download our DeepSeek R1 distlled 7B model but we currently do not provide the full R1 model. You can use our toolkit to quantize your down R1 model.

Could you please tell me which deepseek-7B model you can compress? If convenient, please provide the link of 7B model.

hsb1995 avatar Feb 24 '25 01:02 hsb1995

@hsb1995

  1. The link you referred to is a GPTQ quant model made by AutoRound. However, that model has not been benchmarked, that i am aware of so I can't say one or the other how good it is. AutoRound does not use the same algorithm but generated the a model format that is compatible with GPTQ.
  2. Please check https://github.com/ModelCloud/GPTQModel#citation for link to the papers. We use the same original GPTQ alogorithm pioneered by IST-DASLab.
  3. Please check our readme for link to our quantized DeepSeek 7B model with full-benchmarks. https://github.com/ModelCloud/GPTQModel#quality-gptq-4bit-50-bpw-can-match-bf16

Qubitium avatar Feb 24 '25 03:02 Qubitium

https://arxiv.org/abs/2210.17323 Hello professor, is this paper your project's paper?

hsb1995 avatar Feb 24 '25 07:02 hsb1995

https://arxiv.org/abs/2210.17323 Hello professor, is this paper your project's paper?

@Qubitium

hsb1995 avatar Feb 24 '25 07:02 hsb1995

https://arxiv.org/abs/2210.17323 Hello professor, is this paper your project's paper?

This paper was written by the original researchers of GPTQ. GPTQModel is code, based on the original code from the original research team plus many modifications on usage, inference, and quantization.

Qubitium avatar Feb 24 '25 07:02 Qubitium

You can use the bf16 version of R1 to GPTQ quantize. We do not have large H100+ gpu to test FP8 model load. 4090 has too little vram.

https://huggingface.co/unsloth/DeepSeek-R1-BF16/tree/main

hello,to quantize DeepSeek R1 BF16 model to w8a8 using GPTQModel, is there a minimum machine specification recommended?

liu316484231 avatar Mar 17 '25 02:03 liu316484231

You can use the bf16 version of R1 to GPTQ quantize. We do not have large H100+ gpu to test FP8 model load. 4090 has too little vram.

https://huggingface.co/unsloth/DeepSeek-R1-BF16/tree/main

hello,to quantize DeepSeek R1 BF16 model to w8a8 using GPTQModel, is there a minimum machine specification recommended?

Yes. Git as big of a single gpu as you possibly can. Ram you need 2TB

Qubitium avatar Mar 17 '25 02:03 Qubitium

You can use the bf16 version of R1 to GPTQ quantize. We do not have large H100+ gpu to test FP8 model load. 4090 has too little vram. https://huggingface.co/unsloth/DeepSeek-R1-BF16/tree/main

hello,to quantize DeepSeek R1 BF16 model to w8a8 using GPTQModel, is there a minimum machine specification recommended?

Yes. Git as big of a single gpu as you possibly can. Ram you need 2TB

you mean GPU ram or CPU ram ? by the way whatever cpu or gpu, single machine 2t ram is not easy to achieve for most people 😂

liu316484231 avatar Mar 17 '25 02:03 liu316484231

You can use the bf16 version of R1 to GPTQ quantize. We do not have large H100+ gpu to test FP8 model load. 4090 has too little vram. https://huggingface.co/unsloth/DeepSeek-R1-BF16/tree/main

hello,to quantize DeepSeek R1 BF16 model to w8a8 using GPTQModel, is there a minimum machine specification recommended?

Yes. Git as big of a single gpu as you possibly can. Ram you need 2TB

you mean GPU ram or CPU ram ? by the way whatever cpu or gpu, single machine 2t ram is not easy to achieve for most people 😂

DeepSeek v1 at BF16 is huge. 2TB is cpu vram and you need 80GB+ single GPU for quantization.

Qubitium avatar Mar 17 '25 02:03 Qubitium

@hsb1995

  1. The link you referred to is a GPTQ quant model made by AutoRound. However, that model has not been benchmarked, that i am aware of so I can't say one or the other how good it is. AutoRound does not use the same algorithm but generated the a model format that is compatible with GPTQ.
  2. Please check https://github.com/ModelCloud/GPTQModel#citation for link to the papers. We use the same original GPTQ alogorithm pioneered by IST-DASLab.
  3. Please check our readme for link to our quantized DeepSeek 7B model with full-benchmarks. https://github.com/ModelCloud/GPTQModel#quality-gptq-4bit-50-bpw-can-match-bf16

Hello, May I ask which model you'll choose to quantize in your collection as https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2? Or those collection is just a prove of ability?

What make I wonder is that there are multi-version of QwQ-32B and DeepSeek R1-Distill-7B, but none of DeepSeek R1-Distill-32B exists. So is there some problem for quantizing DeepSeek R1-Distill-32B to match performance of bf16 format?

bash99 avatar Apr 16 '25 06:04 bash99

Yes. Git as big of a single gpu as you possibly can. Ram you need 2TB

@Qubitium I am trying to quantize on h200. With single gpu ram of 140GB. CPU RAM 1.8TB

However the calibration data I can use is only 128 samples. if I increase the calibration samples, model seems to run out of GPU memory. Any solution to this?

quantLm14 avatar May 27 '25 16:05 quantLm14