TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

INT4 AWQ quantization fails for Llama 2 7B & 13B with higher tensor parallel degrees

Open ethnzhng opened this issue 1 year ago • 1 comments

System Info

  • TensorRT-LLM v0.9.0
  • Nvidia A10G

Who can help?

@Tracin

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Reproduction

Run quantize.py using int4_awq with tp_size 4, 8 for Llama 2 7B, or with tp_size 8 for 13B

e.g.

python ../quantization/quantize.py --model_dir /llama-2-7b-hf \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --output_dir ./quantized_int4-awq \
                                   --tp_size 4

Expected behavior

Quantization is successful

actual behavior

Llama 2 7B

tp=4
Traceback (most recent call last):                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config                 
    for model_config in torch_to_model_config(                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 240, in torch_to_model_config               
    pack_linear_weights(model_config)                                                                                                       
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_utils.py", line 283, in pack_linear_weights                  
    linear_layer.weight = to_quantized_weight(                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_utils.py", line 233, in to_quantized_weight                  
    (weight / weights_scaling_factor[:, torch.arange(in_dim) // block_size])                                                                
IndexError: index 22 is out of bounds for dimension 0 with size 22
tp=8
...                                                                       
IndexError: index 11 is out of bounds for dimension 0 with size 11  

Llama 2 13B

tp=8
...                                                         
IndexError: index 14 is out of bounds for dimension 0 with size 14

additional notes

Llama 3 8B is able to be quantized without error under the same conditions (int4_awq & tp = 8).

ethnzhng avatar May 21 '24 04:05 ethnzhng

llama-2-7B with tp size 4 does not satisfy the limitation of int4 awq when awq_block_size is 128. You can set --awq_block_size 64 during quantizing the checkpoint. Similar issues for other tests. We might not be able to run 7B with TP8 due to the limitation.

byshiue avatar May 23 '24 09:05 byshiue