v0.9.0 tensorrt_llm_bls model return error: Model '${tensorrt_llm_model_name}' is not ready.
System Info
TensorRT-LLM:v0.9.0 tensorrtllm_backend:v0.9.0
Who can help?
@kaiyux
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- set backend config:
python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:/tensorrtllm_backend/tokenizer/,triton_max_batch_size:8,preprocessing_instance_count:1,tokenizer_type:auto
python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:/tensorrtllm_backend/tokenizer/,triton_max_batch_size:8,postprocessing_instance_count:1,tokenizer_type:auto
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:True,bls_instance_count:8,accumulate_tokens:False
python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:8
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:8,decoupled_mode:True,max_beam_width:1,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/,batching_strategy:inflight_batching,kv_cache_free_gpu_mem_fraction:0.5,max_queue_delay_microseconds:0,exclude_input_in_output:True,max_attention_window_size:12288
- Launch Triton server and get ok
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --grpc_port 9001 --http_port 9000 --metrics_port 9002 --world_size=1 --model_repo=triton_model_repo
+------------------+---------+--------+
| Model | Version | Status |
+------------------+---------+--------+
| ensemble | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
| tensorrt_llm_bls | 1 | READY |
+------------------+---------+--------+
I0521 12:47:40.294420 6079 grpc_server.cc:2463] Started GRPCInferenceService at 0.0.0.0:9001
I0521 12:47:40.294862 6079 http_server.cc:4692] Started HTTPService at 0.0.0.0:9000
I0521 12:47:40.338519 6079 http_server.cc:362] Started Metrics Service at 0.0.0.0:9002
- Query the server with the Triton generate endpoint
curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'
Expected behavior
BLS model return output.
actual behavior
{"error":"Traceback (most recent call last):\n File "/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/model.py", line 94, in execute\n for res in res_gen:\n File "/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/decode.py", line 194, in decode\n for gen_response in self._generate(preproc_response, request):\n File "/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/triton_decoder.py", line 270, in _generate\n for r in self._exec_triton_request(triton_req):\n File "/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/triton_decoder.py", line 130, in _exec_triton_request\n raise pb_utils.TritonModelException(r.error().message())\nc_python_backend_utils.TritonModelException: Model ${tensorrt_llm_model_name} - Error when running inference: Failed for execute the inference request. Model '${tensorrt_llm_model_name}' is not ready.\n"}
additional notes
Must set parameters tensorrt_llm_model_name and tensorrt_llm_draft_model_name? How do you use these two parameters?
please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding
@janpetrov I am interested in speculative decoding. What do I set the draft model name to?
please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding
Yes, the issue is resolved. Thanks.