TensorRT-LLM
TensorRT-LLM copied to clipboard
Llama-2-7b - Tensor Dimension Mismatch Error for AWQ Engine Build for 4 GPUs
Model under test: Llama-2-7b-chat-hf
Following the instructions here, was able to quantize the model and build engine for 1 gpu scenario, but tensor dimension mismatch error happened when building for 4GPUs with TP.
Command: python examples/llama/build.py --model_dir ./Llama-2-7b-chat-hf --quant_ckpt_path ./Llama-2-7b-chat-hf_awq/llama-7b-4bit-gs128-awq.pt --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --world_size 4 --tp_size 4 --output_dir ./examples/llama/out/7b/awq_4gpu/
Error:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /code/tensorrt_llm/examples/llama/build.py:718 in <module> │ │ │ │ 715 │ else: │ │ 716 │ │ args.parallel_build = False │ │ 717 │ │ logger.info('Serially build TensorRT engines.') │ │ ❱ 718 │ │ build(0, args) │ │ 719 │ │ │ 720 │ tok = time.time() │ │ 721 │ t = time.strftime('%H:%M:%S', time.gmtime(tok - tik)) │ │ │ │ /code/tensorrt_llm/examples/llama/build.py:689 in build │ │ │ │ 686 │ │ │ opt_level=args.builder_opt) │ │ 687 │ │ engine_name = get_engine_name(MODEL_NAME, args.dtype, args.tp_size, │ │ 688 │ │ │ │ │ │ │ │ │ args.pp_size, cur_rank) │ │ ❱ 689 │ │ engine = build_rank_engine(builder, builder_config, engine_name, │ │ 690 │ │ │ │ │ │ │ │ cur_rank, args) │ │ 691 │ │ assert engine is not None, f'Failed to build engine for rank {cur_rank}' │ │ 692 │ │ │ │ /code/tensorrt_llm/examples/llama/build.py:543 in build_rank_engine │ │ │ │ 540 │ │ │ │ │ │ │ │ │ │ quant_scales=quant_scales) │ │ 541 │ if args.per_group: │ │ 542 │ │ load_func = load_from_awq_llama if args.weight_only_precision == 'int4_awq' else │ │ ❱ 543 │ │ load_func(tensorrt_llm_llama=tensorrt_llm_llama, │ │ 544 │ │ │ │ quant_ckpt_path=args.quant_ckpt_path, │ │ 545 │ │ │ │ mapping=mapping, │ │ 546 │ │ │ │ dtype=args.dtype) │ │ │ │ /code/tensorrt_llm/examples/llama/weight.py:1237 in load_from_awq_llama │ │ │ │ 1234 │ │ # MLP down_proj (mlp.proj) Linear │ │ 1235 │ │ mPrefix = prefix + "mlp.down_proj" │ │ 1236 │ │ mOp = tensorrt_llm_llama.layers[layer_idx].mlp.proj │ │ ❱ 1237 │ │ process_and_assign_weight(awq_llama, mPrefix, mOp, 0) │ │ 1238 │ │ │ │ 1239 │ │ # MLP gate_proj (mlp.fc) Linear │ │ 1240 │ │ mPrefix = prefix + "mlp.gate_proj" │ │ │ │ /code/tensorrt_llm/examples/llama/weight.py:1108 in process_and_assign_weight │ │ │ │ 1105 │ │ │ pre_quant_scale = pre_quant_scale.split(k // mapping.tp_size, │ │ 1106 │ │ │ │ │ │ │ │ │ │ │ │ │ dim=1)[mapping.tp_rank] │ │ 1107 │ │ scale = amax / 8.0 │ │ ❱ 1108 │ │ mOp.qweight.value = AWQ_quantize_pack_preprocess(weight, scale) │ │ 1109 │ │ mOp.scale.value = scale.to(torch_dtype).cpu().numpy() │ │ 1110 │ │ mOp.pre_quant_scale.value = pre_quant_scale.to( │ │ 1111 │ │ │ torch_dtype).cpu().numpy() │ │ │ │ │ │ /code/tensorrt_llm/examples/llama/weight.py:1087 in AWQ_quantize_pack_preprocess │ │ │ │ 1084 │ │ │ 1085 │ def AWQ_quantize_pack_preprocess(weight, scale): │ │ 1086 │ │ scale = scale.repeat_interleave(group_size, dim=0) │ │ ❱ 1087 │ │ weight = weight / scale │ │ 1088 │ │ qweight_int8 = torch.clamp(torch.round(weight.cuda()).char(), -8, 7) │ │ 1089 │ │ int4_weight = packer(qweight_int8.cpu()) │ │ 1090 │ │ int4_weight = preprocessor(int4_weight, torch.quint4x2) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: The size of tensor a (2752) must match the size of tensor b (2688) at non-singleton dimension 0
Can you test if the issue persists in the main branch, please? I see fixes for Llama AWQ TP > 1 in our internal repo that were pushed to the main branch but are not in release/0.5.0.
Do I need to rebuild the docker image for the main branch? I noticed that there were some files updated under docker folder.
From the notes, it seems there are updates on both main branch and 0.5 release that are not synced.
This branch is [20 commits ahead](https://github.com/NVIDIA/TensorRT-LLM/compare/release/0.5.0...main), [23 commits behind](https://github.com/NVIDIA/TensorRT-LLM/compare/main...release/0.5.0) release/0.5.0.
When will the next official version be available?
I reused the docker image built for 0.5.0 branch and generated a new container with the following command:
REPOSITORY TAG IMAGE ID CREATED SIZE tensorrt_llm/release latest-root 5e43c4749c11 41 hours ago 26.8GB
docker run --name tensorrt-llm-main_test --privileged -idt --net=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -v /mnt/tao-new/TensorRT-LLM-main:/code/tensorrt_llm 5e43c4749c11 bash
But I got tensorrt_llm module not found error when I tried to build the engine:
python examples/llama/build.py --model_dir ./Llama-2-7b-chat-hf --quant_ckpt_path ./Llama-2-7b-chat-hf_awq/llama-7b-4bit-gs128-awq.pt --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --world_size 4 --tp_size 4 --output_dir ./examples/llama/out/7b/awq_4gpu/
ModuleNotFoundError: No module named 'tensorrt_llm.runtime.lora_manager'
Shouldn't it be installed inside the docker image already?
Tried with the latest main branch code and rebuilt the docker image. TP size = 4 still gave me the following errors:
[12/12/2023-00:21:08] [TRT-LLM] [E] Current weight shape is invalid for mapping.tp_size=4
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /code/tensorrt_llm/examples/llama/build.py:839 in <module> │
│ │
│ 836 │ else: │
│ 837 │ │ args.parallel_build = False │
│ 838 │ │ logger.info('Serially build TensorRT engines.') │
│ ❱ 839 │ │ build(0, args) │
│ 840 │ │
│ 841 │ tok = time.time() │
│ 842 │ t = time.strftime('%H:%M:%S', time.gmtime(tok - tik)) │
│ │
│ /code/tensorrt_llm/examples/llama/build.py:783 in build │
│ │
│ 780 │ │ ) │
│ 781 │ │ engine_name = get_engine_name(MODEL_NAME, args.dtype, args.tp_size, │
│ 782 │ │ │ │ │ │ │ │ │ args.pp_size, cur_rank) │
│ ❱ 783 │ │ engine = build_rank_engine(builder, builder_config, engine_name, │
│ 784 │ │ │ │ │ │ │ │ cur_rank, args) │
│ 785 │ │ assert engine is not None, f'Failed to build engine for rank {cur_rank}' │
│ 786 │
│ /code/tensorrt_llm/examples/llama/build.py:602 in build_rank_engine │
│ │
│ 599 │ │ │ │ │ │ │ │ │ │ **quantize_kwargs) │
│ 600 │ if args.per_group: │
│ 601 │ │ load_func = load_from_awq_llama if args.weight_only_precision == 'int4_awq' else │
│ ❱ 602 │ │ load_func(tensorrt_llm_llama=tensorrt_llm_llama, │
│ 603 │ │ │ │ quant_ckpt_path=args.quant_ckpt_path, │
│ 604 │ │ │ │ mapping=mapping, │
│ 605 │ │ │ │ dtype=args.dtype, │
│ │
│ /code/tensorrt_llm/examples/llama/weight.py:1465 in load_from_awq_llama │
│ │
│ 1462 │ │ │
│ 1463 │ │ # 4.4 mlp.proj │
│ 1464 │ │ v = [load(prefix + awq_key_list[7] + suf) for suf in awq_suffix_list] │
│ ❱ 1465 │ │ process_and_assign_weight(layer.mlp.proj, v, 0) │
│ 1466 │ │ │
│ 1467 │ │ # 4.5 mlp.fc │
│ 1468 │ │ v = [load(prefix + awq_key_list[8] + suf) for suf in awq_suffix_list] │
│ │
│ /code/tensorrt_llm/examples/llama/weight.py:1350 in process_and_assign_weight │
│ │
│ 1347 │ │ [k, n] = weight.shape │
│ 1348 │ │ weight = torch_split(weight, tp_dim) │
│ 1349 │ │ amax = v[1].reshape((n, k // group_size)).T.contiguous() │
│ ❱ 1350 │ │ amax = torch_split(amax, tp_dim) │
│ 1351 │ │ pre_quant_scale = v[2].reshape((1, k)) │
│ 1352 │ │ if tp_dim == 0:
│ 1353 │ │ │ pre_quant_scale = torch_split(pre_quant_scale, 1) │
│ │
│ /code/tensorrt_llm/examples/llama/weight.py:1335 in torch_split │
│ │
│ 1332 │ │ │ tensorrt_llm.logger.error( │
│ 1333 │ │ │ │ "Current weight shape is invalid for mapping.tp_size=" + │
│ 1334 │ │ │ │ str(mapping.tp_size)) │
│ ❱ 1335 │ │ │ assert False, "Invalid TP size" │
│ 1336 │ │ return v.split(v.shape[dim] // mapping.tp_size, │
│ 1337 │ │ │ │ │ dim=dim)[mapping.tp_rank] │
│ 1338 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: Invalid TP size
Single GPU built is successful just like before.
In 0.5 release, the generated file is .pt. Why in this version, it is generated as .npz file?
Could you take a try on latest version?