TensorRT-LLM
TensorRT-LLM copied to clipboard
INT4 AWQ quantization fails for Llama 2 7B & 13B with higher tensor parallel degrees
System Info
- TensorRT-LLM v0.9.0
- Nvidia A10G
Who can help?
@Tracin
Information
- [X] The official example scripts
- [ ] My own modified scripts
Reproduction
Run quantize.py using int4_awq with tp_size 4, 8 for Llama 2 7B, or with tp_size 8 for 13B
e.g.
python ../quantization/quantize.py --model_dir /llama-2-7b-hf \
--dtype float16 \
--qformat int4_awq \
--output_dir ./quantized_int4-awq \
--tp_size 4
Expected behavior
Quantization is successful
actual behavior
Llama 2 7B
tp=4
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
for model_config in torch_to_model_config(
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 240, in torch_to_model_config
pack_linear_weights(model_config)
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_utils.py", line 283, in pack_linear_weights
linear_layer.weight = to_quantized_weight(
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_utils.py", line 233, in to_quantized_weight
(weight / weights_scaling_factor[:, torch.arange(in_dim) // block_size])
IndexError: index 22 is out of bounds for dimension 0 with size 22
tp=8
...
IndexError: index 11 is out of bounds for dimension 0 with size 11
Llama 2 13B
tp=8
...
IndexError: index 14 is out of bounds for dimension 0 with size 14
additional notes
Llama 3 8B is able to be quantized without error under the same conditions (int4_awq & tp = 8).
llama-2-7B with tp size 4 does not satisfy the limitation of int4 awq when awq_block_size is 128. You can set --awq_block_size 64 during quantizing the checkpoint. Similar issues for other tests. We might not be able to run 7B with TP8 due to the limitation.