INT4 AWQ quantization fails for Llama 2 7B & 13B with higher tensor parallel degrees

Open ethnzhng opened this issue 1 year ago • 1 comments

System Info

TensorRT-LLM v0.9.0
Nvidia A10G

Who can help?

@Tracin

Information

[X] The official example scripts
[ ] My own modified scripts

Reproduction

Run quantize.py using int4_awq with tp_size 4, 8 for Llama 2 7B, or with tp_size 8 for 13B

e.g.

python ../quantization/quantize.py --model_dir /llama-2-7b-hf \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --output_dir ./quantized_int4-awq \
                                   --tp_size 4

Expected behavior

Quantization is successful

actual behavior

Llama 2 7B

tp=4

Traceback (most recent call last):                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config                 
    for model_config in torch_to_model_config(                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 240, in torch_to_model_config               
    pack_linear_weights(model_config)                                                                                                       
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_utils.py", line 283, in pack_linear_weights                  
    linear_layer.weight = to_quantized_weight(                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_utils.py", line 233, in to_quantized_weight                  
    (weight / weights_scaling_factor[:, torch.arange(in_dim) // block_size])                                                                
IndexError: index 22 is out of bounds for dimension 0 with size 22

tp=8

...                                                                       
IndexError: index 11 is out of bounds for dimension 0 with size 11

Llama 2 13B

tp=8

...                                                         
IndexError: index 14 is out of bounds for dimension 0 with size 14

additional notes

Llama 3 8B is able to be quantized without error under the same conditions (int4_awq & tp = 8).

May 21 '24 04:05 ethnzhng

llama-2-7B with tp size 4 does not satisfy the limitation of int4 awq when awq_block_size is 128. You can set --awq_block_size 64 during quantizing the checkpoint. Similar issues for other tests. We might not be able to run 7B with TP8 due to the limitation.

May 23 '24 09:05 byshiue