smoothquant
smoothquant copied to clipboard
Accuracy drop for Llama
I tried to quantize a Llama model (Llama 13b) by smooth quant, and found that if I only quantize LlamaDecoderLayer then the accuracy would not drop even directly quantize weights and activations, but accuracy would drop a lot when quantizing LlamaMLP, which contains 3 Linear layers and 1 activation layer:
- model: decapoda-research/llama-13b-hf
- dataset: wikitext-2-raw-v1
- split: validation[:1000]
- fp16 accuracy: 0.545
- quantized accuracy (w/o quantized MLP): 0.446
- smooth quant accuracy (w/o quantized MLP): 0.481
- quantized accuracy (w quantized MLP): 0.026
- smooth quant accuracy (w quantized MLP): 0.067
Still, I found that for OPT models, naive quantization would not cause accuracy drop if we don't quantize fc1 and fc2 layers, which means quantize Self-Attenetion layer is just fine for these models, and the most important part of quantization is how we quantize fcs
Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt
@Guangxuan-Xiao
Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt
As I mentioned above, the accuracy drop mostly comes from decoder.mlp, and I have not figured out the proper way to quantize this layer, you may check that.
Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt
As I mentioned above, the accuracy drop mostly comes from
decoder.mlp, and I have not figured out the proper way to quantize this layer, you may check that.
when I disable all down_proj quantization, the accuracy recovers. some down_proj activation ranges from 0.5 to 1800, even with smooth the activation still large enough, and make the weight most value quant to 0 or 1. but as a per-tensor quantization, It do helps a a lot.
the paper says use per-token quantization, but not per-tensor, so how to do per-token quantization? anyone knows.
Still, I found that for OPT models, naive quantization would not cause accuracy drop if we don't quantize
fc1andfc2layers, which means quantize Self-Attenetion layer is just fine for these models, and the most important part of quantization is how we quantizefcs
do you quant the matmul in attention or rope in attention?
Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt
As I mentioned above, the accuracy drop mostly comes from
decoder.mlp, and I have not figured out the proper way to quantize this layer, you may check that.
Do you quant the matmul in attention? I quant the matmul in attention without quantizing LlamaMLP , but accuracy also droped a lot. Is there anything wrong?
I tried to quantize a Llama model (Llama 13b) by smooth quant, and found that if I only quantize
LlamaDecoderLayerthen the accuracy would not drop even directly quantize weights and activations, but accuracy would drop a lot when quantizingLlamaMLP, which contains 3Linearlayers and 1activationlayer:
- model: decapoda-research/llama-13b-hf
- dataset: wikitext-2-raw-v1
- split: validation[:1000]
- fp16 accuracy: 0.545
- quantized accuracy (w/o quantized MLP): 0.446
- smooth quant accuracy (w/o quantized MLP): 0.481
- quantized accuracy (w quantized MLP): 0.026
- smooth quant accuracy (w quantized MLP): 0.067
Why is it that when I write the llama.py quantized llama model with reference to opt.py, the accuracy of the model I get is 0?
you need to quant by per-token for activations and per-channel for weights
