smoothquant icon indicating copy to clipboard operation
smoothquant copied to clipboard

Accuracy drop for Llama

Open fmo-mt opened this issue 2 years ago • 10 comments

I tried to quantize a Llama model (Llama 13b) by smooth quant, and found that if I only quantize LlamaDecoderLayer then the accuracy would not drop even directly quantize weights and activations, but accuracy would drop a lot when quantizing LlamaMLP, which contains 3 Linear layers and 1 activation layer:

  • model: decapoda-research/llama-13b-hf
  • dataset: wikitext-2-raw-v1
  • split: validation[:1000]
  • fp16 accuracy: 0.545
  • quantized accuracy (w/o quantized MLP): 0.446
  • smooth quant accuracy (w/o quantized MLP): 0.481
  • quantized accuracy (w quantized MLP): 0.026
  • smooth quant accuracy (w quantized MLP): 0.067

fmo-mt avatar Jun 08 '23 06:06 fmo-mt

Still, I found that for OPT models, naive quantization would not cause accuracy drop if we don't quantize fc1 and fc2 layers, which means quantize Self-Attenetion layer is just fine for these models, and the most important part of quantization is how we quantize fcs

fmo-mt avatar Jun 09 '23 05:06 fmo-mt

Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt

image

MeJerry215 avatar Jun 21 '23 08:06 MeJerry215

@Guangxuan-Xiao

MeJerry215 avatar Jun 21 '23 09:06 MeJerry215

Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt

image

As I mentioned above, the accuracy drop mostly comes from decoder.mlp, and I have not figured out the proper way to quantize this layer, you may check that.

fmo-mt avatar Jun 22 '23 03:06 fmo-mt

Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt image

As I mentioned above, the accuracy drop mostly comes from decoder.mlp, and I have not figured out the proper way to quantize this layer, you may check that.

when I disable all down_proj quantization, the accuracy recovers. some down_proj activation ranges from 0.5 to 1800, even with smooth the activation still large enough, and make the weight most value quant to 0 or 1. but as a per-tensor quantization, It do helps a a lot.

the paper says use per-token quantization, but not per-tensor, so how to do per-token quantization? anyone knows.

MeJerry215 avatar Jun 26 '23 06:06 MeJerry215

Still, I found that for OPT models, naive quantization would not cause accuracy drop if we don't quantize fc1 and fc2 layers, which means quantize Self-Attenetion layer is just fine for these models, and the most important part of quantization is how we quantize fcs

do you quant the matmul in attention or rope in attention?

yokings avatar Jul 16 '23 10:07 yokings

Can u use smooth quant to quant llama without accuracy drop? I try to quant the llama-7b, but accuracy also drops a lot. @fmo-mt image

As I mentioned above, the accuracy drop mostly comes from decoder.mlp, and I have not figured out the proper way to quantize this layer, you may check that.

Do you quant the matmul in attention? I quant the matmul in attention without quantizing LlamaMLP , but accuracy also droped a lot. Is there anything wrong?

rolex-cjj avatar Aug 22 '23 10:08 rolex-cjj

I tried to quantize a Llama model (Llama 13b) by smooth quant, and found that if I only quantize LlamaDecoderLayer then the accuracy would not drop even directly quantize weights and activations, but accuracy would drop a lot when quantizing LlamaMLP, which contains 3 Linear layers and 1 activation layer:

  • model: decapoda-research/llama-13b-hf
  • dataset: wikitext-2-raw-v1
  • split: validation[:1000]
  • fp16 accuracy: 0.545
  • quantized accuracy (w/o quantized MLP): 0.446
  • smooth quant accuracy (w/o quantized MLP): 0.481
  • quantized accuracy (w quantized MLP): 0.026
  • smooth quant accuracy (w quantized MLP): 0.067

Why is it that when I write the llama.py quantized llama model with reference to opt.py, the accuracy of the model I get is 0?

teacherguan avatar Aug 06 '24 03:08 teacherguan

you need to quant by per-token for activations and per-channel for weights

msz12345 avatar Aug 12 '24 08:08 msz12345