Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel
Remove autogptq clutter and autogptq related configs that are not worth adding backward compat.
GPTQModel has a slight project name change (pypi package and import name stays the same) to GPT-QModel with - as we now have added awq/AutoAWQ into our repo and will be making pr soon to address awq loading using GPT-QModel.
GPTQConfig has the most important changes in this PR:
# New GPTQConfig Property. Applicable for sister Peft/Optimum PRs
act_group_aware (`bool`, *optional*, defaults to `True`):
Use GAR (group aware activation order) during quantization. Has measurable positive impact on quantization
quality. Only applicable when `desc_act = False`. Will forced to be `False` when `desc_act = True`.
# Removed GPTQConfig Properties:
use_cuda_fp16
use_exllama
exllama_config
The 3 removed properties are all related kernel selection. These 3 are a hot potatoe mess and legacy from autogptq. GPT-QModel uses unified backend (existing) property to select kernels. There were compat codes written to convert these 3 properties to backend behind the scenes in 2024 but no longer relevant for 2025.
Note:
- Transformers/Optimum/Peft CI tests should never check for
kernel.QUANT_TYPE(str). GPTQ-QModel will return best performing kernel for the relevant module and it may be different per module due to in/out features and other gptq/module properties in relation to device type + dtype + many factors. - CI tests should only assert check for
kernel.QUANT_TYPEif the test specifies a specific kernel viabackendselection.
cc @mekkcyber for quantization
We have begun AutoAWQ deprecation as well.
-
Fused module codes have all been removed. AutoAWQ used to do quant linear level fusing but I do not believe that this is maintainable or good since if SGLang/vLLM adopts Transformers v5 for model loading, they will do their own auto fusing and the quant module should not interfere with this.
-
IPEX is deprecated by Intel and we have a new AwqTorchFused kernel (based on same Intel TorchFused kernel for GPTQ) so any code/unit tests for IPEX now will point to AwqTorchFused kernel.
Hi @Qubitium ! Thanks a lot for working on this! Quick question, what do you mean by AutoAWQ being part of GPT-QModel now? Did you integrate the entire library (including the transformers dependency, like AutoAWQ does), or did you just port over the linear layers, kernels, and related components?
Hi @Qubitium ! Thanks a lot for working on this! Quick question, what do you mean by AutoAWQ being part of GPT-QModel now? Did you integrate the entire library (including the transformers dependency, like AutoAWQ does), or did you just port over the linear layers, kernels, and related components?
Long story short. We folded AutoAWQ into GPT-QModel in multiple stage over the past few weeks. Stage 1. Simple/Directly port/copy the AutoAWQ code over. Stage 2. Refractor. Stage 3. Fixed bugs, add new kernels. Major refractor to align with new internal life cycle in GPT-QModel v5.0+. So we are current post Stage 3 where GPT-QModel base retains minimal original AutoAWQ code. Most AutoAWQ code have been refractored away.
Major Changes vs AutoAWQ:
- New kernels. We have 2 new kernels for AWQ (
AwqTorchpure torch based, andAwqTorchFusedwhich is cpu optimized based on work by Intel @jiqing-feng. - Plan to add 3rd new AWQ kernel based on Bitblas as most gptq kernels are compatbile with AWQ with some small changes. Marlin kernel also sycned with gptq Marlin kernel for updated Marlin fixes/otpimizations via vllm port.
- QuantLinear code have been rewritten/refractored.
- Quant logic is new due to GPT-QModel 5.0+ life cycle which is not compatible with AutoAWQ.
HF eco system compat:
Work on Peft integration is happening in a parallel pr by @LRL2-ModelCloud https://github.com/huggingface/peft/pull/2917 in coordination from @BenjaminBossan https://github.com/huggingface/peft/issues/2342#issuecomment-3547697772
The Peft pr will need to co-exist concurrently with this PR due to interdependency.
We will hold off Optimum change last if we can help it, or may have to parallel a 3rd Pr to Optimum as well if inter-dependency causes trouble there as well.
Final goal of the 2 Prs is to remove dead AutoGPTQ code (no one uses it or should use it frankly) and almost dead AutoAWQ (repo in readonly and no longer accepting bug fixes or new model support). Compat of model loading of old models that use these two packages will be maintained.
Thanks for working on this @Qubitium . We are still debating if this is something that should offload to GPTQ-Model or we should start upstreaming some of the inference code directly into transformers + kernels. Here is a proposal from a contributor https://github.com/huggingface/transformers/pull/42256.
The goal would be to only upstream the GEMM path but we can potentially leave the other kernels to GPTQ-Model WDYT ?
About GPTQ-Model, will it be possible to create awq quants for newer models that are compatible with other frameworks (e.g vllm) just like autoawq did ?
WDYT
I think badly of this proposal.
Thanks for working on this @Qubitium . We are still debating if this is something that should offload to GPTQ-Model or we should start upstreaming some of the inference code directly into transformers +
kernels. Here is a proposal from a contributor #42256. The goal would be to only upstream the GEMM path but we can potentially leave the other kernels to GPTQ-Model WDYT ?About
GPTQ-Model, will it be possible to create awq quants for newer models that are compatible with other frameworks (e.g vllm) just likeautoawqdid ?
I just checked the PR which has no code. I am not going to waste time arguing over vaporware vs what I have done with AWQ in GPT-QModel over the past 2 months that is awq inference and quantization full stack complete with new kernels, new model support with full ci kernel and modeling validation. GPT-Qmodel can be viewed as not an AutoAWQ port and a full point release upgrade in every regard.
Edit: I have outlined in prior post on why Fusing is a bad idea. It is not AWQ's job to fuse in 2025. Leave it to model makers and higher level engines such as SGLang, vlLLM which HF v5.0 is targeting from my understading.
About GPTQ-Model, will it be possible to create awq quants for newer models that are compatible with other frameworks (e.g vllm) just like autoawq did ?
Our quantized models are more compatible with vllm/SGLang than ones quantized with Optimum or AutoAWQ
SGLang/vLLM compat is a number 1 target/design from day one so 100% yes.
It's a good chance to deprecate autoawq as it's archived. I suppose the best way is to go upstream to the transformer's main codes, just like we did in the Autogptq replacement. For example, the IPEX linear in AutoAWQ is out-of-date, we need a new implementation for it. The new linear implementation is TorchFusedLinear in gptqmodel.
It's a good chance to deprecate autoawq as it's archived. I suppose the best way is to go upstream to the transformer's main codes, just like we did in the Autogptq replacement. For example, the IPEX linear in AutoAWQ is out-of-date, we need a new implementation for it. The new linear implementation is TorchFusedLinear in gptqmodel.
@jiqing-feng The AWQ version of GPTQ TorchFusedKernel has been added to gpt-qmodel as AwqTorchFusedKernel. Same underlying code but memory layout tweaks to get it to work. AWQ Kernel output tests passing.
@SunMarc For the most part the kernels for AWQ and GPTQ are shared. For example, we do not compile an extra awq only Marlin kernel for AWQ. The previous gptq only Marlin kernel is synced from vLLM to run AWQ weights as well.
CI Passing status using GPT-QModel main branch:
transformers/tests/quantization/autoawq/test_awq.py:
test_awq.py::AwqTest::test_quantized_model PASSED
test_awq.py::AwqTest::test_quantized_model_bf16 PASSED
test_awq.py::AwqTest::test_quantized_model_conversion PASSED
test_awq.py::AwqTest::test_quantized_model_exllama FAILED <-- Needs fixing.
test_awq.py::AwqTest::test_quantized_model_multi_accelerator SKIPPED
test_awq.py::AwqTest::test_quantized_model_no_device_map PASSED
test_awq.py::AwqTest::test_save_pretrained PASSED
test_awq.py::AwqTest::test_raise_if_non_quantized PASSED
test_awq.py::AwqTest::test_quantized_model_no_k_proj_quantized PASSED
test_awq.py::AwqScaleTest::test_load_quantized_model PASSED
test_awq.py::AwqIPEXTest::test_quantized_model_ipex PASSED <-- test needs to be renamed to AwqTorchFused (ipex removed)
peft/tests/test_gpu_examples.py:
PeftAwqGPUTests PASSED
PeftGPTQGPUTests PASSED
@SunMarc PR is in working state and ready for prelim review. Look at the code diffs, we are eliminating 5x more crud for every line of code we add for the new awq integration.
@SunMarc @MekkCyber Hold off on review. I will ping once ready. I need to remove more code related to fuse and kernel selection.
@SunMarc @MekkCyber Update. This PR will updated once we finish small refractor and add sync auto kernel selection just like what we did with gptq in https://github.com/ModelCloud/GPTQModel/pull/2214/. Both gptq and awq kernel selection will be folded into single hf_select_quant_linear_v2 interface for stability and single entry point.
in addition, the original AwqGEMM kernel will be split into effectively 3 distinct kernels, TorchGEMM, CudaGEMM, TritonGEMM. The autoawq gemm kernel was actually 3 kernels in one monolithic one. Sound nice but terrible for ci/kernel output regression/comparison tests with zero performance benefit. GPT-QModel will auto select the kernels based on system env, device_map, kernel qualifications (method, format, etc). This will knock off another layer of complexity in existing HF code.
# public/stable api exposed to transformer/optimum
def hf_select_quant_linear_v2(
bits: int,
group_size: int,
desc_act: bool,
sym: bool,
format: Union[str, FORMAT], # awq `version` should be pre-mapped to format
quant_method: Union[str, METHOD], # awq llm-awq `version` should be pre-mapped to method
zero_point: Optional[bool] = True, # awq only
dtype: Optional[Union[str, torch.dtype]] = None,
meta: Optional[Dict[str, any]] = None,
pack: Optional[bool] = True,
device_map: Optional[Union[str, dict]] = None,
backend: Optional[Union[str, BACKEND]] = None,
) -> Type[BaseQuantLinear]:
Thanks, left a couple of comments. As I said, I'm happy to see that you are willing to fill the hole left by AutoAWQ and eager to see this PR merged. However, note that maybe in the future, we will add a default working path for GEMM if gptq-model is not installed. As those libraries depends on kernels that requires to deal with building + distribution for each new version of torch, we never know when this will suddenly stop. Left a couple of comments. Also maybe it will be better to split this PR into 2: one for gptq and one for awq ?
Does it mean we can upstream some specific ops for awq or gptq in the kernel-community? In that case, gptqmodel can pull kernels from the community at runtime?
@SunMarc @MekkCyber PR is now synced to Peft/Optimum pending Prs. Ready for code review for this portion. All tests passing with pending gpt-qmodel 5.4.4 release (later today).
Notable changes:
-
hf_select_quant_linear_v2will now auto select kernel for both gptq/autoawq. No more kernel selection crud in transformers and gptq/awq kernel selection merged into single api strictly used forhffor future api stability. Let gpt-qmodel decide as it has the best view to return the best/latest kernel. -
AutoAWQ
fusingcodes have been removed. This code is not maintainable (static map based, model arch specific) and is not relevant for vllm/sglang as they do their own fusing. Tranformer v5 I believe is also introducing more generic fusing so any manual, per model arch, fusing done by previous autoawq code should be eliminated. -
AwqConfig now inherits from GPTQConfig due to shared properties. For GPTQ, legacy
checkpoint_formatis remapped toformatinternally but for backward compat, until future deprecation, we also write tocheckpoint_formaton save viato_dict. For AWQ,versionis now mapped toformatinternally, and likewise for compat, we write toversionusingformatvalue into_dict. This is consistent with what gpt-qmodel does for code clarity while maintaining backward compat.
@bot /style
Style bot fixed some files and pushed the changes.
@SunMarc Since last review:
- Unused awq properties (fuse related) removed.
- Fixed commented out code related to IPEX and change it to test TorchFused kernel instead (ipex replacement)
- Fixed hf awq kernel selection not passing
device_maptohf_select_quant_linear_v2. Without device_map, it was selecting the wrong kernel since gpt-qmodel needs device info to return best kernel for hw.
The PR currently depends on GPT-QModel 5.4.4 which is not yet released as we are working to resolve asap an internal regression related to gptq packing code: https://github.com/ModelCloud/GPTQModel/issues/2234
LMK when we can merge !
LMK when we can merge !
We are performing final CI run in gpt-qmodel and once that passes. I will push 5.6.0 release asap to pypi so this PR can be ready for merge.
@SunMarc Ready. GPT-QModel v5.6.0 has been released with wheels currently building slowly for all the python/torch versions (may take a few hours): https://github.com/ModelCloud/GPTQModel/releases/tag/v5.6.0
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
[For maintainers] Suggested jobs to run (before merge)
run-slow: autoawq, gptq
View the CircleCI Test Summary for this PR:
https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=41567&sha=e16e48
Just to be sure, this will be part of transformers v5?
Thanks ! About gptq-model installation, would it be possible at some point to fetch the correct wheels depending on the user setup like torch using pip ? This is actually a quite a barrier for new users, or try to give more instructions ?
Right now the pip install script already auto download the precompiled gpt-qmodel whl (125mb) from gitbub releases, if we have the env python,torch,cuda version match. But yeah, it is painful to install if you have to compile from src ranging from 10-20 minutes for compile.
Just to be sure, this will be part of transformers v5?
Yes !
Right now the pip install script already auto download the precompiled gpt-qmodel whl (125mb) from gitbub releases, if we have the env python,torch,cuda version match. But yeah, it is painful to install if you have to compile from src ranging from 10-20 minutes for compile.
Okay, I must have a mismatched values hence it wasn't downloading the pre-compiled wheels. Is there a way to return a warning if this is not the case ?
Okay, I must have a mismatched values hence it wasn't downloading the pre-compiled wheels. Is there a way to return a warning if this is not the case ?
We missed building torch 2.9.1 whl for 5.6.0 so that is likely the cause. Will push 5.6.2 to resolve this other misc setup issues other users have reported plus logwarning when setup cannt match downloaded whl.