TensorRT-LLM
TensorRT-LLM copied to clipboard
satisfyProfile Runtime dimension does not satisfy any optimization profile
I build a tp4 llama 70b engine and try using nsys to profile,and this is the error message
+ nsys profile -o test -t cuda,nvtx --force-overwrite true mpirun -n 4 ./cpp/build/benchmarks/gptSessionBenchmark --duration 1 --warm_up 1 --num_runs 3 --model llama --engine_dir /code/tensorrt_llm/engines/new_s8_weight_only --batch_size 32 --input_output_len 512,3
NCCL version 2.18.3+cuda12.2
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: Tensor 'input_ids' has invalid shape (16384), expected (-1) (/code/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:149)
here is the profile commond
set -ex
export CUDA_VISIBLE_DEVICES="4,5,6,7"
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export NCCL_DEBUG=WARN
tp=4
pp=1
ws=4
engine_dir=/code/tensorrt_llm/engines/new_s8_weight_only
nsys profile -o test -t cuda,nvtx --force-overwrite true \
mpirun -n 4 ./cpp/build/benchmarks/gptSessionBenchmark \
--duration 1 \
--warm_up 1 \
--num_runs 3 \
--model llama \
--engine_dir "${engine_dir}" \
--batch_size 32 \
--input_output_len "512,3" \
and the engine config is
{
"builder_config": {
"autopp_config": null,
"gather_context_logits": false,
"gather_generation_logits": false,
"hidden_act": "silu",
"hidden_size": 8192,
"int8": true,
"lora_target_modules": null,
"max_batch_size": 64,
"max_beam_width": 1,
"max_input_len": 4096,
"max_num_tokens": null,
"max_output_len": 4096,
"max_position_embeddings": 4096,
"max_prompt_embedding_table_size": 0,
"mlp_hidden_size": 28672,
"name": "llama",
"num_heads": 64,
"num_kv_heads": 8,
"num_layers": 80,
"parallel_build": false,
"pipeline_parallel": 1,
"precision": "float16",
"quant_mode": 2,
"tensor_parallel": 4,
"use_refit": false,
"vocab_size": 32000
},
"plugin_config": {
"attention_qk_half_accumulation": false,
"bert_attention_plugin": false,
"context_fmha_type": 1,
"gemm_plugin": "float16",
"gpt_attention_plugin": "float16",
"identity_plugin": false,
"layernorm_plugin": false,
"layernorm_quantization_plugin": false,
"lookup_plugin": false,
"lora_plugin": false,
"multi_block_mode": false,
"nccl_plugin": "float16",
"paged_kv_cache": false,
"quantize_per_token_plugin": false,
"quantize_tensor_plugin": false,
"remove_input_padding": true,
"rmsnorm_plugin": false,
"rmsnorm_quantization_plugin": false,
"smooth_quant_gemm_plugin": false,
"tokens_per_block": 0,
"use_context_fmha_for_generation": false,
"use_custom_all_reduce": false,
"use_paged_context_fmha": false,
"weight_only_groupwise_quant_matmul_plugin": false,
"weight_only_quant_matmul_plugin": "float16"
}
}
the problem is that the engine is built with maxbs=64, max_input_len = 4096, max_output_len=4096
why the test shape bs=32,in_seql=512, out_seql=3 does not satisfy any optimization profile?
this is the engine build command
set -ex
export CUDA_VISIBLE_DEVICES="4,5,6,7"
export OMPI_ALLOW_RUN_AS_ROOT=1
export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
export NCCL_DEBUG=WARN
tp=4
pp=1
ws=4
model_dir=/share/models/official/Llama-2-70b-chat-hf
engine_dir=/code/tensorrt_llm/engines/new_s8_weight_only
python3 build.py --model_dir=${model_dir} \
--remove_input_padding \
--world_size ${ws} \
--tp_size ${tp} \
--pp_size ${pp} \
--dtype float16 \
--enable_context_fmha \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--max_batch_size 64 \
--max_input_len 2048 \
--max_output_len 512 \
--output_dir ${engine_dir} \
--per_channel \
--use_weight_only \
>>> import tensorrt_llm
>>> tensorrt_llm.__version__
'0.7.1'
Could you try on latest main branch?
Same problem with TensorRT-LLM version: 0.10.0.dev2024043000
[TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
Same problem with TensorRT-LLM version: 0.10.0.dev2024043000
[TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
Thank you for the report. Although you say you encounter the same problem, but the error looks different. Could you share your full reproduced steps and the full log?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
This issue was closed because it has been stalled for 15 days with no activity.