tensorrtllm_backend Decoupled BLS model: Flag TRITONSERVER_RESPONSE_COMPLETE_FINAL doesn't seem to be sent in the last response sent to the grpc client for async stream request

Decoupled BLS model: Flag TRITONSERVER_RESPONSE_COMPLETE_FINAL doesn't seem to be sent in the last response sent to the grpc client for async stream request

Open Ace-RR opened this issue 1 year ago • 0 comments

System Info

CPU Architecture: x86_64
GPU: 4x L40S
Cuda version: 12.4
TensorRT-LLM: 0.11.0.dev2024052800
Triton Server version 2.45.0
Triton release : 24.04

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Call the bls model in decouple mode with a async stream inference with grpc client

Look the value of the parameter triton_final_response in the last response sent to the grpc client

Expected behavior

parameters { key: "triton_final_response" value { bool_param: true } }

actual behavior

parameters { key: "triton_final_response" value { bool_param: false } }

additional notes

I replace the line 102 in tensorrt_llm_bls/1/model.py by: last_token="" response_sender.send( pb_utils.InferenceResponse( output_tensors=[ pb_utils.Tensor('text_output', numpy.array([last_token.encode('utf-8')], dtype=numpy.object_))

    ]
),
flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)

If no InferenceResponse send with the falg, the response doesn't seem to be sent to the client

Jul 31 '24 14:07 Ace-RR

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Decoupled BLS model: Flag TRITONSERVER_RESPONSE_COMPLETE_FINAL doesn't seem to be sent in the last response sent to the grpc client for async stream request

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

tensorrtllm_backend
tensorrtllm_backend copied to clipboard