tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Decoupled BLS model: Flag TRITONSERVER_RESPONSE_COMPLETE_FINAL doesn't seem to be sent in the last response sent to the grpc client for async stream request

Open Ace-RR opened this issue 1 year ago • 0 comments

System Info

  • CPU Architecture: x86_64
  • GPU: 4x L40S
  • Cuda version: 12.4
  • TensorRT-LLM: 0.11.0.dev2024052800
  • Triton Server version 2.45.0
  • Triton release : 24.04

Who can help?

No response

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Call the bls model in decouple mode with a async stream inference with grpc client

Look the value of the parameter triton_final_response in the last response sent to the grpc client

Expected behavior

parameters { key: "triton_final_response" value { bool_param: true } }

actual behavior

parameters { key: "triton_final_response" value { bool_param: false } }

additional notes

I replace the line 102 in tensorrt_llm_bls/1/model.py by: last_token="" response_sender.send( pb_utils.InferenceResponse( output_tensors=[ pb_utils.Tensor('text_output', numpy.array([last_token.encode('utf-8')], dtype=numpy.object_))

    ]
),
flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)

If no InferenceResponse send with the falg, the response doesn't seem to be sent to the client

Ace-RR avatar Jul 31 '24 14:07 Ace-RR