TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

CodeLlama-7B int4-awq get error of "The value updated is not the same shape as the original. "

Open activezhao opened this issue 1 year ago • 15 comments

System Info

CPU x86_64

GPU NVIDIA A10

TensorRT branch: main commid id:cad22332550eef9be579e767beb7d605dd96d6f3

CUDA: NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4

Who can help?

Quantization: @Tracin

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

I use the command for int4-awq with CodeLlama-7B.

python tensorrt_llm/examples/quantization/quantize.py --model_dir /data/META-CodeLlama-7b-hf/ \
                --dtype float16 \
                --qformat int4_awq \
                --export_path /data/trt_llama_7b_quantized_int4-awq \
                --calib_size 32

And I get the files like this:

trt_llama_7b_quantized_int4-awq/
├── llama_tp1.json
└── llama_tp1_rank0.npz

Then I try to to build engines_int4_AWQ

python build.py --model_dir /data/META-CodeLlama-7b-hf/ \                --quant_ckpt_path /data/trt_llama_7b_quantized_int4-awq/llama_tp1_rank0.npz \                --dtype float16 \                --remove_input_padding \                --use_gpt_attention_plugin float16 \
                --paged_kv_cache \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --vocab_size 32064  \
                --rotary_base 1000000  \
                --use_weight_only \
                --weight_only_precision int4_awq \
                --per_group \
                --output_dir /data/llama_7B_trt_engines_int4_AWQ/1-gpu/

Expected behavior

Get engine files, maybe like this:

-rw-r--r-- 1 root root       2191 Jan 25 07:39 config.json
-rw-r--r-- 1 root root 6876290868 Jan 25 07:39 llama_float16_tp2_rank0.engine
-rw-r--r-- 1 root root 6876290868 Jan 25 07:39 llama_float16_tp2_rank1.engine
-rw-r--r-- 1 root root     163591 Jan 25 07:39 model.cache

actual behavior

get errors: AssertionError: The value updated is not the same shape as the original. Updated: (32064, 4096), original: (32016, 4096)

like this:

root@7728b95eabcd:/data/tensorrt_llm/examples/llama# python build.py --model_dir /data/META-CodeLlama-7b-hf/ \                --quant_ckpt_path /data/trt_llama_7b_quantized_int4-awq/llama_tp1_rank0.npz \                --dtype float16 \                --remove_input_padding \                --use_gpt_attention_plugin float16 \
                --paged_kv_cache \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --vocab_size 32064  \
                --rotary_base 1000000  \
                --use_weight_only \
                --weight_only_precision int4_awq \
                --per_group \
                --output_dir /data/llama_7B_trt_engines_int4_AWQ/1-gpu/
[01/25/2024-01:36:11] [TRT-LLM] [I] Serially build TensorRT engines.
[01/25/2024-01:36:11] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 133, GPU 228 (MiB)
[01/25/2024-01:36:13] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2068, GPU 540 (MiB)
[01/25/2024-01:36:13] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[01/25/2024-01:36:13] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 2.5261 (GiB) Device 0.0000 (GiB)
[01/25/2024-01:36:13] [TRT-LLM] [I] Loading weights from groupwise AWQ LLaMA checkpoint...
Traceback (most recent call last):
  File "/data/tensorrt_llm/examples/llama/build.py", line 1051, in <module>
    build(0, args)
  File "/data/tensorrt_llm/examples/llama/build.py", line 995, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/data/tensorrt_llm/examples/llama/build.py", line 873, in build_rank_engine
    tensorrt_llm_llama = get_model_object(args,
  File "/data/tensorrt_llm/examples/llama/build.py", line 745, in get_model_object
    load_from_awq_llama(tensorrt_llm_llama=tensorrt_llm_llama,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1522, in load_from_awq_llama
    tensorrt_llm_llama.vocab_embedding.weight.value = v.to(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 113, in value
    assert v.shape == self._shape, \
AssertionError: The value updated is not the same shape as the original. Updated: (32064, 4096), original: (32016, 4096)

additional notes

CodeLlama-7b-hf/config.json, the vocab_size is 32016

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 16384,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.32.0.dev0",
  "use_cache": true,
  "vocab_size": 32016
}

But trt_llama_7b_quantized_int4-awq/llama_tp1.json, the vocab_size is 32064.

{"version": 0.4, "quantization": "int4_awq", "awq_block_size": 128, "dtype": "float16", "vocab_size": 32064, "rank": 0, "tensor_parallel": 1, "vocab_embedding": {"weight": "_np:vocab_embedding:weight"}, "positional_embedding": null, "layers": [{"decoder_type": "llama", "input_layernorm": {"weight": "_np:layers:0:input_layernorm:weight", "bias": null, "layernorm_type": "rms", "eps": 1e-05}, "mlp_layernorm": null

The vocab_size is different, so how to solve this problem?

Thanks.

activezhao avatar Jan 25 '24 08:01 activezhao

Please try to add quantize_lm_head option to build.py.

Tracin avatar Jan 26 '24 09:01 Tracin

Please try to add quantize_lm_head option to build.py.

@Tracin OK, thanks for your reply, I will just try it.

activezhao avatar Jan 26 '24 15:01 activezhao

Please try to add quantize_lm_head option to build.py.

@Tracin I added the parameter of --quantize_lm_head, but a new error appeared.

python build.py --model_dir /data/META-CodeLlama-7b-hf/ \
                --quant_ckpt_path /data/trt_llama_7b_quantized_int4-awq/llama_tp1_rank0.npz \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --quantize_lm_head \
                --use_weight_only \
                --weight_only_precision int4_awq \
                --per_group \
                --output_dir /data/llama_7B_trt_engines_int4_AWQ/1-gpu/
[01/29/2024-01:42:25] [TRT-LLM] [I] To use awq we pad vocab_size to 32064.
[01/29/2024-01:42:25] [TRT-LLM] [I] Serially build TensorRT engines.
[01/29/2024-01:42:25] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 133, GPU 228 (MiB)
[01/29/2024-01:42:27] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2068, GPU 540 (MiB)
[01/29/2024-01:42:27] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[01/29/2024-01:42:27] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 2.5260 (GiB) Device 0.0000 (GiB)
[01/29/2024-01:42:27] [TRT-LLM] [I] Loading weights from groupwise AWQ LLaMA checkpoint...
Traceback (most recent call last):
  File "/data/tensorrt_llm/examples/llama/build.py", line 1051, in <module>
    build(0, args)
  File "/data/tensorrt_llm/examples/llama/build.py", line 995, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/data/tensorrt_llm/examples/llama/build.py", line 873, in build_rank_engine
    tensorrt_llm_llama = get_model_object(args,
  File "/data/tensorrt_llm/examples/llama/build.py", line 745, in get_model_object
    load_from_awq_llama(tensorrt_llm_llama=tensorrt_llm_llama,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1527, in load_from_awq_llama
    v = [load(awq_key_list[1] + suf) for suf in awq_suffix_list]
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1527, in <listcomp>
    v = [load(awq_key_list[1] + suf) for suf in awq_suffix_list]
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/weight.py", line 1403, in load
    v = torch.from_numpy(awq_llama[awq_prefix + key])
  File "/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py", line 263, in __getitem__
    raise KeyError(f"{key} is not a file in the archive")
KeyError: '_np:lm_head:weights_scaling_factor is not a file in the archive'

And I print the values of awq_key_list and awq_suffix_list.

awq_key_list is:['vocab_embedding:weight', 'lm_head', 'final_layernorm:weight', 'attention:qkv:', '', 'attention:dense', 'mlp:gate', 'mlp:proj', 'mlp:fc', 'input_layernorm:weight', 'post_layernorm:weight']

awq_suffix_list is: [':weight', ':weights_scaling_factor', ':prequant_scaling_factor']

activezhao avatar Jan 29 '24 01:01 activezhao

Sorry for ambiguous instruction, you have to add --quantize_lm_head also for quantize.py. Since there is a bug in AMMO, we have to enable this before next release.

Tracin avatar Jan 29 '24 02:01 Tracin

Sorry for ambiguous instruction, you have to add --quantize_lm_head also for quantize.py. Since there is a bug in AMMO, we have to enable this before next release.

@Tracin Ok, got it, I will try it. thanks

activezhao avatar Jan 29 '24 04:01 activezhao

@Tracin It works, so nice.

But, when I launch Triton Server, the error is:

E0129 06:39:43.236314 25545 model_repository_manager.cc:580] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp * pp (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:94)
1       0x7fec6c8bb6ed /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x176ed) [0x7fec6c8bb6ed]
2       0x7fec6c8d4814 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x30814) [0x7fec6c8d4814]
3       0x7fec6c919cc8 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x75cc8) [0x7fec6c919cc8]
4       0x7fec6c90f0a8 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6b0a8) [0x7fec6c90f0a8]
5       0x7fec6c8f03ee /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4c3ee) [0x7fec6c8f03ee]
6       0x7fec6c8f14e2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4d4e2) [0x7fec6c8f14e2]
7       0x7fec6c8e1045 TRITONBACKEND_ModelInstanceInitialize + 101
8       0x7feccef89226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7feccef89226]
9       0x7feccef8a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7feccef8a466]
10      0x7feccef6d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7feccef6d165]
11      0x7feccef6d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7feccef6d7a6]
12      0x7feccef79a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7feccef79a1d]
13      0x7fecce5e4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fecce5e4ee8]
14      0x7feccef63feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7feccef63feb]
15      0x7feccef73dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7feccef73dc5]
16      0x7feccef78d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7feccef78d36]
17      0x7feccf069330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7feccf069330]
18      0x7feccf06ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7feccf06ca23]
19      0x7feccf1c0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7feccf1c0d82]
20      0x7fecce84f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fecce84f253]
21      0x7fecce5dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fecce5dfac3]
22      0x7fecce670814 clone + 68;
I0129 06:39:43.236334 25545 model_lifecycle.cc:286] VersionStates() 'tensorrt_llm'
I0129 06:39:43.236375 25545 server.cc:606]

How to solve it?

activezhao avatar Jan 29 '24 14:01 activezhao

Assertion failed: mpiSize == tp * pp Did you run with mpi?

Tracin avatar Jan 30 '24 02:01 Tracin

Assertion failed: mpiSize == tp * pp Did you run with mpi?

@Tracin Yes,I use this scripts/launch_triton_server.py to launch Triton Server.

activezhao avatar Jan 30 '24 08:01 activezhao

Assertion failed: mpiSize == tp * pp Did you run with mpi?

@Tracin Yes,I use this scripts/launch_triton_server.py to launch Triton Server.

@byshiue Can you help with this?

Tracin avatar Jan 30 '24 08:01 Tracin

Assertion failed: mpiSize == tp * pp Did you run with mpi?

@Tracin Yes,I use this scripts/launch_triton_server.py to launch Triton Server.

@byshiue Can you help with this?

@Tracin I have solved the error. Thanks so much.

activezhao avatar Jan 30 '24 09:01 activezhao

Hi @Tracin I have two questions, could you help me answer them?

1、I use int4_awq engines, max_batch_size is 8, one GPU, the Throughput is 379 tokens/s. But int8_weight + kv cache engines, max_batch_size is 64, two GPUs, and the Throughput is 1286 tokens/s. Is this expected?

2、Does int4_awq support tp? I have seen that the code does not support it. Is there any change now?

Plus, the config.json which generated by build.py, when I change the max_batch_size from default 8 to 16 or other number, error will appear, such as:

curl --noproxy '*'  -X POST localhost:8310/v2/models/ensemble/generate -d '{"text_input": "def quickSort", "max_tokens": 100, "bad_words": "", "stop_words": "quickSort","end_id": 2}'
{"error":"in ensemble 'ensemble', Encountered error for requestId 1804289384: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: Tensor 'cache_indirection' has invalid shape (16, 1, 2560), expected (-1, 1, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:149)\n1       0x7f3adc8bb6ed /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x176ed) [0x7f3adc8bb6ed]\n2       0x7f3adc9e3676 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x13f676) [0x7f3adc9e3676]\n3       0x7f3adc923217 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7f217) [0x7f3adc923217]\n4       0x7f3adc92651d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8251d) [0x7f3adc92651d]\n5       0x7f3adc929e10 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x85e10) [0x7f3adc929e10]\n6       0x7f3adc910254 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6c254) [0x7f3adc910254]\n7       0x7f3adc91710f /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7310f) [0x7f3adc91710f]\n8       0x7f3b3c64f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f3b3c64f253]\n9       0x7f3b3c3dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3b3c3dfac3]\n10      0x7f3b3c470814 clone + 68"}

Is the max_batch_size cannot be changed?

The config.json which generated by build.py is:

root@7728b95eabcd:/data# cat llama_7B_trt_engines_int4_AWQ/1-gpu/config.json 
{
  "builder_config": {
    "autopp_config": null,
    "gather_context_logits": false,
    "gather_generation_logits": false,
    "hf_modules_to_trtllm_modules": {
      "down_proj": "mlp_4h_to_h",
      "gate_proj": "mlp_h_to_4h",
      "k_proj": "attn_k",
      "o_proj": "attn_dense",
      "q_proj": "attn_q",
      "up_proj": "mlp_gate",
      "v_proj": "attn_v"
    },
    "hidden_act": "silu",
    "hidden_size": 4096,
    "int8": false,
    "lora_target_modules": null,
    "max_batch_size": 8,
    "max_beam_width": 1,
    "max_input_len": 2048,
    "max_num_tokens": null,
    "max_output_len": 512,
    "max_position_embeddings": 16384,
    "max_prompt_embedding_table_size": 0,
    "mlp_hidden_size": 11008,
    "name": "llama",
    "num_heads": 32,
    "num_kv_heads": 32,
    "num_layers": 32,
    "parallel_build": false,
    "pipeline_parallel": 1,
    "precision": "float16",
    "quant_mode": 33,
    "tensor_parallel": 1,
    "trtllm_modules_to_hf_modules": {
      "attn_dense": "o_proj",
      "attn_k": "k_proj",
      "attn_q": "q_proj",
      "attn_v": "v_proj",
      "mlp_4h_to_h": "down_proj",
      "mlp_gate": "up_proj",
      "mlp_h_to_4h": "gate_proj"
    },
    "use_refit": false,
    "vocab_size": 32064
  },
  "plugin_config": {
    "attention_qk_half_accumulation": false,
    "bert_attention_plugin": false,
    "context_fmha_type": 1,
    "enable_xqa": false,
    "gemm_plugin": "float16",
    "gpt_attention_plugin": "float16",
    "identity_plugin": false,
    "layernorm_plugin": false,
    "layernorm_quantization_plugin": false,
    "lookup_plugin": false,
    "lora_plugin": false,
    "multi_block_mode": false,
    "nccl_plugin": false,
    "paged_kv_cache": true,
    "quantize_per_token_plugin": false,
    "quantize_tensor_plugin": false,
    "remove_input_padding": true,
    "rmsnorm_plugin": false,
    "rmsnorm_quantization_plugin": false,
    "smooth_quant_gemm_plugin": false,
    "tokens_per_block": 128,
    "use_context_fmha_for_generation": false,
    "use_custom_all_reduce": false,
    "use_paged_context_fmha": false,
    "weight_only_groupwise_quant_matmul_plugin": "float16",
    "weight_only_quant_matmul_plugin": false
  }
}

Thanks.

activezhao avatar Jan 30 '24 11:01 activezhao

  1. I am not sure, there are more than one variants in the experiments.
  2. int4_awq supports tp_size>1
  3. If you want to change max_batch_size, you have to re-build the engine.

Tracin avatar Feb 01 '24 07:02 Tracin

  1. I am not sure, there are more than one variants in the experiments.
  2. int4_awq supports tp_size>1
  3. If you want to change max_batch_size, you have to re-build the engine.

@Tracin Got it. I will just try them.

Thanks.

activezhao avatar Feb 01 '24 14:02 activezhao

Assertion failed: mpiSize == tp * pp Did you run with mpi?

Hi, how did you solve this error? I met the same problem

moonlightian avatar Mar 21 '24 06:03 moonlightian

When you want to use model parallelism with launch_triton_server.py, you need to setup --world_size.

byshiue avatar Mar 27 '24 07:03 byshiue