fastertransformer_backend icon indicating copy to clipboard operation
fastertransformer_backend copied to clipboard

Can i stop execution? (w/ `decoupled mode`)

Open Yeom opened this issue 1 year ago • 1 comments

Description

Docker: nvcr.io/nvidia/tritonserver:23.04-py3
Gpu: A100

How can i stop bi-direction streaming(decoupled mode)?
- I want to stop model inference(streaming response) when the user disconnects or according to certain conditions, but I don't know how to do that at the moment.


Reference
- https://github.com/triton-inference-server/server/issues/4344
- https://github.com/triton-inference-server/server/issues/5833#issuecomment-1561318646

Reproduced Steps

-

Yeom avatar Aug 21 '23 00:08 Yeom

i meet a similar problem. if ft server encouters stop token during generating, but the already generate tokens' length shorter than the max_new_tokens, the ft server will continue reply the same result, but don't stop the streaming.

client.stop_stream() is called, but it will block until the result's lenth equal the max_new_tokens.

is there any way to get out?

shanekong avatar Sep 12 '23 08:09 shanekong