TensorRT-LLM satisfyProfile Runtime dimension does not satisfy any optimization profile

I build a tp4 llama 70b engine and try using nsys to profile，and this is the error message

+ nsys profile -o test -t cuda,nvtx --force-overwrite true mpirun -n 4 ./cpp/build/benchmarks/gptSessionBenchmark --duration 1 --warm_up 1 --num_runs 3 --model llama --engine_dir /code/tensorrt_llm/engines/new_s8_weight_only --batch_size 32 --input_output_len 512,3
NCCL version 2.18.3+cuda12.2
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: Tensor 'input_ids' has invalid shape (16384), expected (-1) (/code/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:149)

here is the profile commond

set -ex

export CUDA_VISIBLE_DEVICES="4,5,6,7"
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export NCCL_DEBUG=WARN

tp=4
pp=1
ws=4

engine_dir=/code/tensorrt_llm/engines/new_s8_weight_only

nsys profile -o test -t cuda,nvtx --force-overwrite true \
    mpirun -n 4 ./cpp/build/benchmarks/gptSessionBenchmark \
    --duration 1 \
    --warm_up 1 \
    --num_runs 3 \
    --model llama \
    --engine_dir "${engine_dir}" \
    --batch_size 32 \
    --input_output_len "512,3" \

and the engine config is

{
  "builder_config": {
    "autopp_config": null,
    "gather_context_logits": false,
    "gather_generation_logits": false,
    "hidden_act": "silu",
    "hidden_size": 8192,
    "int8": true,
    "lora_target_modules": null,
    "max_batch_size": 64,
    "max_beam_width": 1,
    "max_input_len": 4096,
    "max_num_tokens": null,
    "max_output_len": 4096,
    "max_position_embeddings": 4096,
    "max_prompt_embedding_table_size": 0,
    "mlp_hidden_size": 28672,
    "name": "llama",
    "num_heads": 64,
    "num_kv_heads": 8,
    "num_layers": 80,
    "parallel_build": false,
    "pipeline_parallel": 1,
    "precision": "float16",
    "quant_mode": 2,
    "tensor_parallel": 4,
    "use_refit": false,
    "vocab_size": 32000
  },
  "plugin_config": {
    "attention_qk_half_accumulation": false,
    "bert_attention_plugin": false,
    "context_fmha_type": 1,
    "gemm_plugin": "float16",
    "gpt_attention_plugin": "float16",
    "identity_plugin": false,
    "layernorm_plugin": false,
    "layernorm_quantization_plugin": false,
    "lookup_plugin": false,
    "lora_plugin": false,
    "multi_block_mode": false,
    "nccl_plugin": "float16",
    "paged_kv_cache": false,
    "quantize_per_token_plugin": false,
    "quantize_tensor_plugin": false,
    "remove_input_padding": true,
    "rmsnorm_plugin": false,
    "rmsnorm_quantization_plugin": false,
    "smooth_quant_gemm_plugin": false,
    "tokens_per_block": 0,
    "use_context_fmha_for_generation": false,
    "use_custom_all_reduce": false,
    "use_paged_context_fmha": false,
    "weight_only_groupwise_quant_matmul_plugin": false,
    "weight_only_quant_matmul_plugin": "float16"
  }
}

the problem is that the engine is built with maxbs=64, max_input_len = 4096, max_output_len=4096

why the test shape bs=32，in_seql=512, out_seql=3 does not satisfy any optimization profile?

Mar 22 '24 07:03 hjchen-thu

this is the engine build command

set -ex

export CUDA_VISIBLE_DEVICES="4,5,6,7"
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export NCCL_DEBUG=WARN

tp=4
pp=1
ws=4

model_dir=/share/models/official/Llama-2-70b-chat-hf
engine_dir=/code/tensorrt_llm/engines/new_s8_weight_only


python3 build.py --model_dir=${model_dir} \
                --remove_input_padding \
                --world_size ${ws} \
                --tp_size ${tp} \
                --pp_size ${pp} \
                --dtype float16 \
                --enable_context_fmha \
                --use_gpt_attention_plugin float16 \
                --use_gemm_plugin float16 \
                --max_batch_size 64 \
                --max_input_len 2048 \
                --max_output_len  512 \
                --output_dir ${engine_dir} \
                --per_channel \
                --use_weight_only \

>>> import tensorrt_llm
>>> tensorrt_llm.__version__
'0.7.1'

Mar 22 '24 07:03 hjchen-thu

Could you try on latest main branch?

Mar 25 '24 06:03 byshiue

Same problem with TensorRT-LLM version: 0.10.0.dev2024043000

[TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )

May 02 '24 16:05 RoslinAdama

Same problem with TensorRT-LLM version: 0.10.0.dev2024043000

[TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )

Thank you for the report. Although you say you encounter the same problem, but the error looks different. Could you share your full reproduced steps and the full log?

May 09 '24 02:05 byshiue

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Jun 09 '24 01:06 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

Jun 25 '24 01:06 github-actions[bot]

TensorRT-LLM TensorRT-LLM copied to clipboard

satisfyProfile Runtime dimension does not satisfy any optimization profile

TensorRT-LLM
TensorRT-LLM copied to clipboard