tensorrtllm_backend v0.9.0 tensorrt_llm_bls model return error: Model '${tensorrt_llm_model

System Info

TensorRT-LLM：v0.9.0 tensorrtllm_backend：v0.9.0

Who can help?

@kaiyux

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

set backend config:

python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:/tensorrtllm_backend/tokenizer/,triton_max_batch_size:8,preprocessing_instance_count:1,tokenizer_type:auto

python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:/tensorrtllm_backend/tokenizer/,triton_max_batch_size:8,postprocessing_instance_count:1,tokenizer_type:auto

python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:True,bls_instance_count:8,accumulate_tokens:False

python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:8

python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:8,decoupled_mode:True,max_beam_width:1,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/,batching_strategy:inflight_batching,kv_cache_free_gpu_mem_fraction:0.5,max_queue_delay_microseconds:0,exclude_input_in_output:True,max_attention_window_size:12288

Launch Triton server and get ok

python3 /tensorrtllm_backend/scripts/launch_triton_server.py --grpc_port 9001 --http_port 9000 --metrics_port 9002 --world_size=1 --model_repo=triton_model_repo 

+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| ensemble         | 1       | READY  |
| postprocessing   | 1       | READY  |
| preprocessing    | 1       | READY  |
| tensorrt_llm     | 1       | READY  |
| tensorrt_llm_bls | 1       | READY  |
+------------------+---------+--------+
I0521 12:47:40.294420 6079 grpc_server.cc:2463] Started GRPCInferenceService at 0.0.0.0:9001
I0521 12:47:40.294862 6079 http_server.cc:4692] Started HTTPService at 0.0.0.0:9000
I0521 12:47:40.338519 6079 http_server.cc:362] Started Metrics Service at 0.0.0.0:9002

Query the server with the Triton generate endpoint

curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'

Expected behavior

BLS model return output.

actual behavior

{"error":"Traceback (most recent call last):\n File "/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/model.py", line 94, in execute\n for res in res_gen:\n File "/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/decode.py", line 194, in decode\n for gen_response in self._generate(preproc_response, request):\n File "/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/triton_decoder.py", line 270, in _generate\n for r in self._exec_triton_request(triton_req):\n File "/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/triton_decoder.py", line 130, in _exec_triton_request\n raise pb_utils.TritonModelException(r.error().message())\nc_python_backend_utils.TritonModelException: Model ${tensorrt_llm_model_name} - Error when running inference: Failed for execute the inference request. Model '${tensorrt_llm_model_name}' is not ready.\n"}

additional notes

Must set parameters tensorrt_llm_model_name and tensorrt_llm_draft_model_name? How do you use these two parameters?

May 22 '24 03:05 plt12138

please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding

May 23 '24 15:05 janpetrov

@janpetrov I am interested in speculative decoding. What do I set the draft model name to?

May 23 '24 20:05 avianion

please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding

Yes, the issue is resolved. Thanks.

May 24 '24 04:05 plt12138

v0.9.0 tensorrt_llm_bls model return error: Model '${tensorrt_llm_model_name}' is not ready.

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes