auto-round issues

Results 10 auto-round issues

Sort by recently updated

speedup by disable_low_gpu_mem_usage and reduce memory usage by avoid using torch.cat

smoke test done: llama3 with lmhead bachuan13b with lmhead chatglm3(lm head name transformer.output_layer) opt tied_lm-head gemma-7b phi-2 lm head mixtral Qwen1.5-7B-Chat lm-head Baichuan2-7B-Chat lm-head gpt-j-6b lm-head LaMini-GPT-124M conv1d tied weight...

wenhuach21

Add marlin and modify acc.md

pursure-D

OPT model quantize_lm_head clarification

While testing for OPT with `quant_lm_head=True`, here are the result weights post quantize: `weight keys: ['lm_head.g_idx', 'lm_head.qweight', 'lm_head.qzeros', 'lm_head.scales', 'model.decoder.embed_positions.weight', 'model.decoder.embed_tokens.weight', ...` `model.decoder.embed_tokens.weight` is not quantized but `lm_head` is. Unforutnately...

Qubitium

Set the default scale_dtype to FP16

There's no necessity to use FP32 scale for packing with the autogptq Triton backend. We can instead set FP16 scale dtype as the default. Nonetheless, it's essential to validate accuracy...

wenhuach21

enhancement

WIP: fix compat with latest autogptq and use meta region to store auto-round properties

Reason for PR: 1. Fix compat with latest autogptq 2. Store autoround fingerprint/version using `meta_set_quantizer(name, version)` api 3. Store autoround specific parameters, unrelated to actual autogptq inference/quantization, into meta region...

Qubitium

support F.linear and matmul in some moe models

https://huggingface.co/databricks/dbrx-instruct/blob/main/modeling_dbrx.py A simple but engineering ugly solution is to follow https://huggingface.co/databricks/dbrx-instruct/discussions/10 to change the matmul to linear, let's follow this way to add a patch for this model

wenhuach21

enhancement

large discrepancy between GPTQ model and qdq model at W2 asym

waitting for the fix https://github.com/AutoGPTQ/AutoGPTQ/pull/640

wenhuach21

auto-round
auto-round copied to clipboard

Metadata

speedup by disable_low_gpu_mem_usage and reduce memory usage by avoid using torch.cat

Add marlin and modify acc.md

OPT model quantize_lm_head clarification

Set the default scale_dtype to FP16

WIP: fix compat with latest autogptq and use meta region to store auto-round properties

support F.linear and matmul in some moe models

Unexpected ppl diff

support autoawq format

hook AutoHfQuantizer of transformers to support different backends and mixed precision quantization

large discrepancy between GPTQ model and qdq model at W2 asym

← Metadata

Owner

Metadata

auto-round auto-round copied to clipboard

Metadata

← Metadata

Owner

Metadata

auto-round
auto-round copied to clipboard