Wenhua Cheng issues

Results 24 issues of


                                            Wenhua Cheng

fix gradient based criterion bug

Signed-off-by: wenhuach21 ## Type of Change feature or bug fix or documentation or validation or others API changed or not ## Description detail description JIRA ticket: xxx ## Expected Behavior...

bug fix

support conv1d in quantization algorithms

Several models, such as LaMini-GPT, are utilizing this layer, but unfortunately, most of our algorithms do not currently support it. W8A8: SQ weight-only: RTN, TEQ better support tranformers.conv1d and torch.conv1d...

smooth quant pattern is incomplete at folding=True

for llama, 2 patterns have not been detected, mlp.down_proj->mlp.up_proj, .self_attn.o_proj->module.self_attn.v_proj for opt, self_attn.out_proj->self_attn.v_proj

[Question] Differences in quantization logic compared to AWQ

If we compare with the asym quantization logic with AWQ, there are some differences, a major distinction is whether the range of min-max values should include zero. In AWQ, zero...

enhancement

speedup by disable_low_gpu_mem_usage and reduce memory usage by avoid using torch.cat

smoke test done: llama3 with lmhead bachuan13b with lmhead chatglm3(lm head name transformer.output_layer) opt tied_lm-head gemma-7b phi-2 lm head mixtral Qwen1.5-7B-Chat lm-head Baichuan2-7B-Chat lm-head gpt-j-6b lm-head LaMini-GPT-124M conv1d tied weight...

Set the default scale_dtype to FP16

There's no necessity to use FP32 scale for packing with the autogptq Triton backend. We can instead set FP16 scale dtype as the default. Nonetheless, it's essential to validate accuracy...

enhancement

[Low priory]support F.linear and matmul in some moe models

https://huggingface.co/databricks/dbrx-instruct/blob/main/modeling_dbrx.py A simple but engineering ugly solution is to follow https://huggingface.co/databricks/dbrx-instruct/discussions/10 to change the matmul to linear, let's follow this way to add a patch for this model

enhancement

Wenhua Cheng

fix gradient based criterion bug

support conv1d in quantization algorithms

smooth quant pattern is incomplete at folding=True

[Question] Differences in quantization logic compared to AWQ

speedup by disable_low_gpu_mem_usage and reduce memory usage by avoid using torch.cat

Set the default scale_dtype to FP16

[Low priory]support F.linear and matmul in some moe models

hook AutoHfQuantizer of transformers to support different backends and mixed precision quantization

large discrepancy between GPTQ model and qdq model at W2 asym

support tensor parallelism