fastertransformer_backend
fastertransformer_backend copied to clipboard
How to terminate a grpc streaming request immediately during tritonserver inference with a FasterTransformer backend?
In a production environment like ChatGPT, early termination of a conversation based on user-client commands can be a major requirement. I'm wondering whether a grpc streaming request can be terminated immediately during tritonserver inference with a FasterTransformer backend? Could you please give some advice?
with grpcclient.InferenceServerClient(self.model_url) as client:
client.start_stream(callback=partial(stream_callback, result_queue))
client.async_stream_infer(self.model_name, request_data)
async_stream_infer maybe need a package_input?