server
server copied to clipboard
Allow non-decoupled model to send response and FINAL flag separately
Follow up on https://github.com/triton-inference-server/core/pull/229 For custom backend, one may call send response in the following style, even for "non-decoupled" model
TRITONBACKEND_ResponseSend(response, 0, nullptr /* success */);
TRITONBACKEND_ResponseFactorySendFlags(factory, TRITONSERVER_RESPONSE_COMPLETE_FINAL);
This yields single response of the request from the client's perspective and would expect a successful inference through HTTP / non-streaming GRPC as it is "non-decoupled". The previous implementation is restricted that it simply assume decoupled use case when response complete callback is invoked multiple time. The change is to relax the restriction and collapse the above into single response to the client.
@GuanLuo do we still plan to implement this to help unify decoupled/non-decoupled logic in backends like RIVA/TRTLLM etc?
@rmccorm4 , I've created a ticket for this: DLIS-6308