auto-round icon indicating copy to clipboard operation
auto-round copied to clipboard

SOTA Weight-only Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs"

Results 10 auto-round issues
Sort by recently updated
recently updated
newest added

smoke test done: llama3 with lmhead bachuan13b with lmhead chatglm3(lm head name transformer.output_layer) opt tied_lm-head gemma-7b phi-2 lm head mixtral Qwen1.5-7B-Chat lm-head Baichuan2-7B-Chat lm-head gpt-j-6b lm-head LaMini-GPT-124M conv1d tied weight...

While testing for OPT with `quant_lm_head=True`, here are the result weights post quantize: `weight keys: ['lm_head.g_idx', 'lm_head.qweight', 'lm_head.qzeros', 'lm_head.scales', 'model.decoder.embed_positions.weight', 'model.decoder.embed_tokens.weight', ...` `model.decoder.embed_tokens.weight` is not quantized but `lm_head` is. Unforutnately...

There's no necessity to use FP32 scale for packing with the autogptq Triton backend. We can instead set FP16 scale dtype as the default. Nonetheless, it's essential to validate accuracy...

enhancement

Reason for PR: 1. Fix compat with latest autogptq 2. Store autoround fingerprint/version using `meta_set_quantizer(name, version)` api 3. Store autoround specific parameters, unrelated to actual autogptq inference/quantization, into meta region...

https://huggingface.co/databricks/dbrx-instruct/blob/main/modeling_dbrx.py A simple but engineering ugly solution is to follow https://huggingface.co/databricks/dbrx-instruct/discussions/10 to change the matmul to linear, let's follow this way to add a patch for this model

enhancement

I'm now trying to quantize llama2-7b under w4a16g128 setting. The script is ` python3 main.py \ --model_name /mnt/bn/wyh-train/4bit/models/llama2-7b/model \ --device 0 \ --group_size 128 \ --bits 4 \ --iters 1000...

Feature request 1 support different kernels in different backend, including gptq/awq/itrex 2 support different bits and group_size for different layers

enhancement

waitting for the fix https://github.com/AutoGPTQ/AutoGPTQ/pull/640