Assertion failed: Invalid tensor name: decoder_input_lengths
System Info
- docker image: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
- tensorrt_llm: 0.9.0
Who can help?
@kaiyux @byshiue
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
docker run --rm -it --gpus all --net host --shm-size=64g
--ulimit stack=67108864
-v /share/datasets/tmp_share/chenyonghua/models/tensorrt_engines_v0.9.0/llama/Llama-2-7b-chat_TP1/tensorrtllm_backend:/tensorrtllm_backend
-v /share/datasets/public_models/Llama-2-7b-chat-hf:/share/datasets/public_models/Llama-2-7b-chat-hf
nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 bash
cd /tensorrtllm_backend
export CUDA_VISIBLE_DEVICES=0
python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo
curl -X POST localhost:8000/v2/models/ensemble/generate -d
'{
"text_input": "How do I count to nine in French?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'
Expected behavior
I would expect the tensorrt engine to work with the triton inference server. And I could get a correct respond
actual behavior
client respond:
{"error":"in ensemble 'ensemble', [TensorRT-LLM][ERROR] Assertion failed: Invalid tensor name: decoder_input_lengths (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/../tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/inferenceRequest.h:269)\n1 0x7fdc94c6eba4 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits
server log :
I0620 09:19:22.915296 878 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA H800" I0620 09:19:22.986074 878 metrics.cc:770] "Collecting CPU metrics" I0620 09:19:22.986387 878 tritonserver.cc:2557] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.46.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /tensorrtllm_backend/triton_model_repo | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | model_config_name | | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0620 09:19:22.989899 878 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
I0620 09:19:22.990119 878 http_server.cc:4692] "Started HTTPService at 0.0.0.0:8000"
I0620 09:19:23.031496 878 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
[[ 1 1128 437 306 2302 304 14183 297 5176 29973]]
[[10]]
[TensorRT-LLM][ERROR] [TensorRT-LLM][ERROR] Assertion failed: Invalid tensor name: decoder_input_lengths (/tmp/tritonbuild/tensorrtllm/inflight_batcher_llm/../tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/inferenceRequest.h:269)
1 0x7fdc94c6eba4 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits
additional notes
- model: llama2-7b-chat
Did you solve it?