TensorRT-LLM
TensorRT-LLM copied to clipboard
CodeLlama-7B int4-awq get error of "The value updated is not the same shape as the original. "
System Info
CPU x86_64
GPU NVIDIA A10
TensorRT branch: main commid id:cad22332550eef9be579e767beb7d605dd96d6f3
CUDA: NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4
Who can help?
Quantization: @Tracin
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I use the command for int4-awq with CodeLlama-7B.
python tensorrt_llm/examples/quantization/quantize.py --model_dir /data/META-CodeLlama-7b-hf/ \
--dtype float16 \
--qformat int4_awq \
--export_path /data/trt_llama_7b_quantized_int4-awq \
--calib_size 32
And I get the files like this:
trt_llama_7b_quantized_int4-awq/
├── llama_tp1.json
└── llama_tp1_rank0.npz
Then I try to to build engines_int4_AWQ
python build.py --model_dir /data/META-CodeLlama-7b-hf/ \ --quant_ckpt_path /data/trt_llama_7b_quantized_int4-awq/llama_tp1_rank0.npz \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \
--paged_kv_cache \
--use_inflight_batching \
--enable_context_fmha \
--use_gemm_plugin float16 \
--vocab_size 32064 \
--rotary_base 1000000 \
--use_weight_only \
--weight_only_precision int4_awq \
--per_group \
--output_dir /data/llama_7B_trt_engines_int4_AWQ/1-gpu/
Expected behavior
Get engine files, maybe like this:
-rw-r--r-- 1 root root 2191 Jan 25 07:39 config.json
-rw-r--r-- 1 root root 6876290868 Jan 25 07:39 llama_float16_tp2_rank0.engine
-rw-r--r-- 1 root root 6876290868 Jan 25 07:39 llama_float16_tp2_rank1.engine
-rw-r--r-- 1 root root 163591 Jan 25 07:39 model.cache
actual behavior
get errors:
AssertionError: The value updated is not the same shape as the original. Updated: (32064, 4096), original: (32016, 4096)
like this:
root@7728b95eabcd:/data/tensorrt_llm/examples/llama# python build.py --model_dir /data/META-CodeLlama-7b-hf/ \ --quant_ckpt_path /data/trt_llama_7b_quantized_int4-awq/llama_tp1_rank0.npz \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \
--paged_kv_cache \
--use_inflight_batching \
--enable_context_fmha \
--use_gemm_plugin float16 \
--vocab_size 32064 \
--rotary_base 1000000 \
--use_weight_only \
--weight_only_precision int4_awq \
--per_group \
--output_dir /data/llama_7B_trt_engines_int4_AWQ/1-gpu/
[01/25/2024-01:36:11] [TRT-LLM] [I] Serially build TensorRT engines.
[01/25/2024-01:36:11] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 133, GPU 228 (MiB)
[01/25/2024-01:36:13] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2068, GPU 540 (MiB)
[01/25/2024-01:36:13] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[01/25/2024-01:36:13] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 2.5261 (GiB) Device 0.0000 (GiB)
[01/25/2024-01:36:13] [TRT-LLM] [I] Loading weights from groupwise AWQ LLaMA checkpoint...
Traceback (most recent call last):
File "/data/tensorrt_llm/examples/llama/build.py", line 1051, in <module>
build(0, args)
File "/data/tensorrt_llm/examples/llama/build.py", line 995, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/data/tensorrt_llm/examples/llama/build.py", line 873, in build_rank_engine
tensorrt_llm_llama = get_model_object(args,
File "/data/tensorrt_llm/examples/llama/build.py", line 745, in get_model_object
load_from_awq_llama(tensorrt_llm_llama=tensorrt_llm_llama,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1522, in load_from_awq_llama
tensorrt_llm_llama.vocab_embedding.weight.value = v.to(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 113, in value
assert v.shape == self._shape, \
AssertionError: The value updated is not the same shape as the original. Updated: (32064, 4096), original: (32016, 4096)
additional notes
CodeLlama-7b-hf/config.json, the vocab_size is 32016
{
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 16384,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 1000000,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.32.0.dev0",
"use_cache": true,
"vocab_size": 32016
}
But trt_llama_7b_quantized_int4-awq/llama_tp1.json, the vocab_size is 32064.
{"version": 0.4, "quantization": "int4_awq", "awq_block_size": 128, "dtype": "float16", "vocab_size": 32064, "rank": 0, "tensor_parallel": 1, "vocab_embedding": {"weight": "_np:vocab_embedding:weight"}, "positional_embedding": null, "layers": [{"decoder_type": "llama", "input_layernorm": {"weight": "_np:layers:0:input_layernorm:weight", "bias": null, "layernorm_type": "rms", "eps": 1e-05}, "mlp_layernorm": null
The vocab_size is different, so how to solve this problem?
Thanks.
Please try to add quantize_lm_head option to build.py.
Please try to add
quantize_lm_headoption to build.py.
@Tracin OK, thanks for your reply, I will just try it.
Please try to add
quantize_lm_headoption to build.py.
@Tracin I added the parameter of --quantize_lm_head, but a new error appeared.
python build.py --model_dir /data/META-CodeLlama-7b-hf/ \
--quant_ckpt_path /data/trt_llama_7b_quantized_int4-awq/llama_tp1_rank0.npz \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--quantize_lm_head \
--use_weight_only \
--weight_only_precision int4_awq \
--per_group \
--output_dir /data/llama_7B_trt_engines_int4_AWQ/1-gpu/
[01/29/2024-01:42:25] [TRT-LLM] [I] To use awq we pad vocab_size to 32064.
[01/29/2024-01:42:25] [TRT-LLM] [I] Serially build TensorRT engines.
[01/29/2024-01:42:25] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 133, GPU 228 (MiB)
[01/29/2024-01:42:27] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2068, GPU 540 (MiB)
[01/29/2024-01:42:27] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[01/29/2024-01:42:27] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 2.5260 (GiB) Device 0.0000 (GiB)
[01/29/2024-01:42:27] [TRT-LLM] [I] Loading weights from groupwise AWQ LLaMA checkpoint...
Traceback (most recent call last):
File "/data/tensorrt_llm/examples/llama/build.py", line 1051, in <module>
build(0, args)
File "/data/tensorrt_llm/examples/llama/build.py", line 995, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/data/tensorrt_llm/examples/llama/build.py", line 873, in build_rank_engine
tensorrt_llm_llama = get_model_object(args,
File "/data/tensorrt_llm/examples/llama/build.py", line 745, in get_model_object
load_from_awq_llama(tensorrt_llm_llama=tensorrt_llm_llama,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1527, in load_from_awq_llama
v = [load(awq_key_list[1] + suf) for suf in awq_suffix_list]
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1527, in <listcomp>
v = [load(awq_key_list[1] + suf) for suf in awq_suffix_list]
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1403, in load
v = torch.from_numpy(awq_llama[awq_prefix + key])
File "/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py", line 263, in __getitem__
raise KeyError(f"{key} is not a file in the archive")
KeyError: '_np:lm_head:weights_scaling_factor is not a file in the archive'
And I print the values of awq_key_list and awq_suffix_list.
awq_key_list is:['vocab_embedding:weight', 'lm_head', 'final_layernorm:weight', 'attention:qkv:', '', 'attention:dense', 'mlp:gate', 'mlp:proj', 'mlp:fc', 'input_layernorm:weight', 'post_layernorm:weight']
awq_suffix_list is: [':weight', ':weights_scaling_factor', ':prequant_scaling_factor']
Sorry for ambiguous instruction, you have to add --quantize_lm_head also for quantize.py. Since there is a bug in AMMO, we have to enable this before next release.
Sorry for ambiguous instruction, you have to add
--quantize_lm_headalso forquantize.py. Since there is a bug in AMMO, we have to enable this before next release.
@Tracin Ok, got it, I will try it. thanks
@Tracin It works, so nice.
But, when I launch Triton Server, the error is:
E0129 06:39:43.236314 25545 model_repository_manager.cc:580] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp * pp (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:94)
1 0x7fec6c8bb6ed /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x176ed) [0x7fec6c8bb6ed]
2 0x7fec6c8d4814 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x30814) [0x7fec6c8d4814]
3 0x7fec6c919cc8 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x75cc8) [0x7fec6c919cc8]
4 0x7fec6c90f0a8 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6b0a8) [0x7fec6c90f0a8]
5 0x7fec6c8f03ee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4c3ee) [0x7fec6c8f03ee]
6 0x7fec6c8f14e2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4d4e2) [0x7fec6c8f14e2]
7 0x7fec6c8e1045 TRITONBACKEND_ModelInstanceInitialize + 101
8 0x7feccef89226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7feccef89226]
9 0x7feccef8a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7feccef8a466]
10 0x7feccef6d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7feccef6d165]
11 0x7feccef6d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7feccef6d7a6]
12 0x7feccef79a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7feccef79a1d]
13 0x7fecce5e4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fecce5e4ee8]
14 0x7feccef63feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7feccef63feb]
15 0x7feccef73dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7feccef73dc5]
16 0x7feccef78d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7feccef78d36]
17 0x7feccf069330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7feccf069330]
18 0x7feccf06ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7feccf06ca23]
19 0x7feccf1c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7feccf1c0d82]
20 0x7fecce84f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fecce84f253]
21 0x7fecce5dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fecce5dfac3]
22 0x7fecce670814 clone + 68;
I0129 06:39:43.236334 25545 model_lifecycle.cc:286] VersionStates() 'tensorrt_llm'
I0129 06:39:43.236375 25545 server.cc:606]
How to solve it?
Assertion failed: mpiSize == tp * pp Did you run with mpi?
Assertion failed: mpiSize == tp * ppDid you run with mpi?
@Tracin Yes,I use this scripts/launch_triton_server.py to launch Triton Server.
Assertion failed: mpiSize == tp * ppDid you run with mpi?@Tracin Yes,I use this
scripts/launch_triton_server.pyto launch Triton Server.
@byshiue Can you help with this?
Assertion failed: mpiSize == tp * ppDid you run with mpi?@Tracin Yes,I use this
scripts/launch_triton_server.pyto launch Triton Server.
@byshiue Can you help with this?
@Tracin I have solved the error. Thanks so much.
Hi @Tracin I have two questions, could you help me answer them?
1、I use int4_awq engines, max_batch_size is 8, one GPU, the Throughput is 379 tokens/s. But int8_weight + kv cache engines, max_batch_size is 64, two GPUs, and the Throughput is 1286 tokens/s. Is this expected?
2、Does int4_awq support tp? I have seen that the code does not support it. Is there any change now?
Plus, the config.json which generated by build.py, when I change the max_batch_size from default 8 to 16 or other number, error will appear, such as:
curl --noproxy '*' -X POST localhost:8310/v2/models/ensemble/generate -d '{"text_input": "def quickSort", "max_tokens": 100, "bad_words": "", "stop_words": "quickSort","end_id": 2}'
{"error":"in ensemble 'ensemble', Encountered error for requestId 1804289384: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: Tensor 'cache_indirection' has invalid shape (16, 1, 2560), expected (-1, 1, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:149)\n1 0x7f3adc8bb6ed /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x176ed) [0x7f3adc8bb6ed]\n2 0x7f3adc9e3676 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x13f676) [0x7f3adc9e3676]\n3 0x7f3adc923217 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7f217) [0x7f3adc923217]\n4 0x7f3adc92651d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8251d) [0x7f3adc92651d]\n5 0x7f3adc929e10 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x85e10) [0x7f3adc929e10]\n6 0x7f3adc910254 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6c254) [0x7f3adc910254]\n7 0x7f3adc91710f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7310f) [0x7f3adc91710f]\n8 0x7f3b3c64f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f3b3c64f253]\n9 0x7f3b3c3dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3b3c3dfac3]\n10 0x7f3b3c470814 clone + 68"}
Is the max_batch_size cannot be changed?
The config.json which generated by build.py is:
root@7728b95eabcd:/data# cat llama_7B_trt_engines_int4_AWQ/1-gpu/config.json
{
"builder_config": {
"autopp_config": null,
"gather_context_logits": false,
"gather_generation_logits": false,
"hf_modules_to_trtllm_modules": {
"down_proj": "mlp_4h_to_h",
"gate_proj": "mlp_h_to_4h",
"k_proj": "attn_k",
"o_proj": "attn_dense",
"q_proj": "attn_q",
"up_proj": "mlp_gate",
"v_proj": "attn_v"
},
"hidden_act": "silu",
"hidden_size": 4096,
"int8": false,
"lora_target_modules": null,
"max_batch_size": 8,
"max_beam_width": 1,
"max_input_len": 2048,
"max_num_tokens": null,
"max_output_len": 512,
"max_position_embeddings": 16384,
"max_prompt_embedding_table_size": 0,
"mlp_hidden_size": 11008,
"name": "llama",
"num_heads": 32,
"num_kv_heads": 32,
"num_layers": 32,
"parallel_build": false,
"pipeline_parallel": 1,
"precision": "float16",
"quant_mode": 33,
"tensor_parallel": 1,
"trtllm_modules_to_hf_modules": {
"attn_dense": "o_proj",
"attn_k": "k_proj",
"attn_q": "q_proj",
"attn_v": "v_proj",
"mlp_4h_to_h": "down_proj",
"mlp_gate": "up_proj",
"mlp_h_to_4h": "gate_proj"
},
"use_refit": false,
"vocab_size": 32064
},
"plugin_config": {
"attention_qk_half_accumulation": false,
"bert_attention_plugin": false,
"context_fmha_type": 1,
"enable_xqa": false,
"gemm_plugin": "float16",
"gpt_attention_plugin": "float16",
"identity_plugin": false,
"layernorm_plugin": false,
"layernorm_quantization_plugin": false,
"lookup_plugin": false,
"lora_plugin": false,
"multi_block_mode": false,
"nccl_plugin": false,
"paged_kv_cache": true,
"quantize_per_token_plugin": false,
"quantize_tensor_plugin": false,
"remove_input_padding": true,
"rmsnorm_plugin": false,
"rmsnorm_quantization_plugin": false,
"smooth_quant_gemm_plugin": false,
"tokens_per_block": 128,
"use_context_fmha_for_generation": false,
"use_custom_all_reduce": false,
"use_paged_context_fmha": false,
"weight_only_groupwise_quant_matmul_plugin": "float16",
"weight_only_quant_matmul_plugin": false
}
}
Thanks.
- I am not sure, there are more than one variants in the experiments.
- int4_awq supports tp_size>1
- If you want to change max_batch_size, you have to re-build the engine.
- I am not sure, there are more than one variants in the experiments.
- int4_awq supports tp_size>1
- If you want to change max_batch_size, you have to re-build the engine.
@Tracin Got it. I will just try them.
Thanks.
Assertion failed: mpiSize == tp * ppDid you run with mpi?
Hi, how did you solve this error? I met the same problem
When you want to use model parallelism with launch_triton_server.py, you need to setup --world_size.