smoothquant Accuracy drop for Llama

I tried to quantize a Llama model (Llama 13b) by smooth quant, and found that if I only quantize LlamaDecoderLayer then the accuracy would not drop even directly quantize weights and activations, but accuracy would drop a lot when quantizing LlamaMLP, which contains 3 Linear layers and 1 activation layer:

model: decapoda-research/llama-13b-hf
dataset: wikitext-2-raw-v1
split: validation[:1000]
fp16 accuracy: 0.545
quantized accuracy (w/o quantized MLP): 0.446
smooth quant accuracy (w/o quantized MLP): 0.481
quantized accuracy (w quantized MLP): 0.026
smooth quant accuracy (w quantized MLP): 0.067

Jun 08 '23 06:06 fmo-mt

Still, I found that for OPT models, naive quantization would not cause accuracy drop if we don't quantize fc1 and fc2 layers, which means quantize Self-Attenetion layer is just fine for these models, and the most important part of quantization is how we quantize fcs

Jun 09 '23 05:06 fmo-mt

Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt

Jun 21 '23 08:06 MeJerry215

@Guangxuan-Xiao

Jun 21 '23 09:06 MeJerry215

Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt

As I mentioned above, the accuracy drop mostly comes from decoder.mlp, and I have not figured out the proper way to quantize this layer, you may check that.

Jun 22 '23 03:06 fmo-mt

Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt

As I mentioned above, the accuracy drop mostly comes from decoder.mlp, and I have not figured out the proper way to quantize this layer, you may check that.

when I disable all down_proj quantization, the accuracy recovers. some down_proj activation ranges from 0.5 to 1800, even with smooth the activation still large enough, and make the weight most value quant to 0 or 1. but as a per-tensor quantization, It do helps a a lot.

the paper says use per-token quantization, but not per-tensor, so how to do per-token quantization? anyone knows.

Jun 26 '23 06:06 MeJerry215

Still, I found that for OPT models, naive quantization would not cause accuracy drop if we don't quantize fc1 and fc2 layers, which means quantize Self-Attenetion layer is just fine for these models, and the most important part of quantization is how we quantize fcs

do you quant the matmul in attention or rope in attention?

Jul 16 '23 10:07 yokings

Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt

As I mentioned above, the accuracy drop mostly comes from decoder.mlp, and I have not figured out the proper way to quantize this layer, you may check that.

Do you quant the matmul in attention? I quant the matmul in attention without quantizing LlamaMLP , but accuracy also droped a lot. Is there anything wrong?

Aug 22 '23 10:08 rolex-cjj

I tried to quantize a Llama model (Llama 13b) by smooth quant, and found that if I only quantize LlamaDecoderLayer then the accuracy would not drop even directly quantize weights and activations, but accuracy would drop a lot when quantizing LlamaMLP, which contains 3 Linear layers and 1 activation layer:

model: decapoda-research/llama-13b-hf

dataset: wikitext-2-raw-v1

split: validation[:1000]

fp16 accuracy: 0.545

quantized accuracy (w/o quantized MLP): 0.446

smooth quant accuracy (w/o quantized MLP): 0.481

quantized accuracy (w quantized MLP): 0.026

smooth quant accuracy (w quantized MLP): 0.067

Why is it that when I write the llama.py quantized llama model with reference to opt.py, the accuracy of the model I get is 0?

Aug 06 '24 03:08 teacherguan

you need to quant by per-token for activations and per-channel for weights

Aug 12 '24 08:08 msz12345

smoothquant smoothquant copied to clipboard

Accuracy drop for Llama

smoothquant
smoothquant copied to clipboard