tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Decoupled BLS model: Flag TRITONSERVER_RESPONSE_COMPLETE_FINAL doesn't seem to be sent in the last response sent to the grpc client for async stream request
System Info
- CPU Architecture: x86_64
- GPU: 4x L40S
- Cuda version: 12.4
- TensorRT-LLM: 0.11.0.dev2024052800
- Triton Server version 2.45.0
- Triton release : 24.04
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Call the bls model in decouple mode with a async stream inference with grpc client
Look the value of the parameter triton_final_response in the last response sent to the grpc client
Expected behavior
parameters { key: "triton_final_response" value { bool_param: true } }
actual behavior
parameters { key: "triton_final_response" value { bool_param: false } }
additional notes
I replace the line 102 in tensorrt_llm_bls/1/model.py by: last_token="" response_sender.send( pb_utils.InferenceResponse( output_tensors=[ pb_utils.Tensor('text_output', numpy.array([last_token.encode('utf-8')], dtype=numpy.object_))
]
),
flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
If no InferenceResponse send with the falg, the response doesn't seem to be sent to the client