AutoAWQ
AutoAWQ copied to clipboard
about the shape of qzeros in awq quantization model
@casper-hansen Hi, I have a question about the awq quantization model on HuggingFace, https://huggingface.co/TheBloke/Llama-2-7B-AWQ/tree/main?show_file_info=model.safetensors.
The shapes of qzero and scales are as follows: model.layers.0.mlp.down_proj.qweight | [11 008, 512] | I32 model.layers.0.mlp.down_proj.qzeros | [86, 512] | I32 model.layers.0.mlp.down_proj.scales | [86, 4096] | F16
Why the second dimension of them is not 1, but 4096? As I understand, a quantization group shares same qzero and scale, and I have checked the 4096 values are not the same one.
@MuYu-zhi please check out the gemm linear module. All weights are packed in a special way that is related to execution of CUDA kernels.
Is there any detailed explanation behind the pack operation, or code fragment in this git that implements the pack operation.