Wenhua Cheng
Wenhua Cheng
Signed-off-by: wenhuach21 ## Type of Change feature or bug fix or documentation or validation or others API changed or not ## Description detail description JIRA ticket: xxx ## Expected Behavior...
Several models, such as LaMini-GPT, are utilizing this layer, but unfortunately, most of our algorithms do not currently support it. W8A8: SQ weight-only: RTN, TEQ better support tranformers.conv1d and torch.conv1d...
for llama, 2 patterns have not been detected, mlp.down_proj->mlp.up_proj, .self_attn.o_proj->module.self_attn.v_proj for opt, self_attn.out_proj->self_attn.v_proj
If we compare with the asym quantization logic with AWQ, there are some differences, a major distinction is whether the range of min-max values should include zero. In AWQ, zero...
smoke test done: llama3 with lmhead bachuan13b with lmhead chatglm3(lm head name transformer.output_layer) opt tied_lm-head gemma-7b phi-2 lm head mixtral Qwen1.5-7B-Chat lm-head Baichuan2-7B-Chat lm-head gpt-j-6b lm-head LaMini-GPT-124M conv1d tied weight...
There's no necessity to use FP32 scale for packing with the autogptq Triton backend. We can instead set FP16 scale dtype as the default. Nonetheless, it's essential to validate accuracy...
https://huggingface.co/databricks/dbrx-instruct/blob/main/modeling_dbrx.py A simple but engineering ugly solution is to follow https://huggingface.co/databricks/dbrx-instruct/discussions/10 to change the matmul to linear, let's follow this way to add a patch for this model
hook AutoHfQuantizer of transformers to support different backends and mixed precision quantization
Feature request 1 support different kernels in different backend, including gptq/awq/itrex 2 support different bits and group_size for different layers
waitting for the fix https://github.com/AutoGPTQ/AutoGPTQ/pull/640
for calibration with lm-head quantization or in tuning