server icon indicating copy to clipboard operation
server copied to clipboard

Allow non-decoupled model to send response and FINAL flag separately

Open GuanLuo opened this issue 1 year ago • 1 comments

Follow up on https://github.com/triton-inference-server/core/pull/229 For custom backend, one may call send response in the following style, even for "non-decoupled" model

TRITONBACKEND_ResponseSend(response, 0, nullptr /* success */);
TRITONBACKEND_ResponseFactorySendFlags(factory, TRITONSERVER_RESPONSE_COMPLETE_FINAL);

This yields single response of the request from the client's perspective and would expect a successful inference through HTTP / non-streaming GRPC as it is "non-decoupled". The previous implementation is restricted that it simply assume decoupled use case when response complete callback is invoked multiple time. The change is to relax the restriction and collapse the above into single response to the client.

GuanLuo avatar Jul 04 '23 00:07 GuanLuo

@GuanLuo do we still plan to implement this to help unify decoupled/non-decoupled logic in backends like RIVA/TRTLLM etc?

rmccorm4 avatar Feb 09 '24 20:02 rmccorm4

@rmccorm4 , I've created a ticket for this: DLIS-6308

oandreeva-nv avatar Mar 08 '24 19:03 oandreeva-nv