fastertransformer_backend Can i stop execution? (w/ `decoupled mode`)

Can i stop execution? (w/ `decoupled mode`)

Open Yeom opened this issue 2 years ago • 1 comments

Description

Docker: nvcr.io/nvidia/tritonserver:23.04-py3
Gpu: A100

How can i stop bi-direction streaming(decoupled mode)?
- I want to stop model inference(streaming response) when the user disconnects or according to certain conditions, but I don't know how to do that at the moment.


Reference
- https://github.com/triton-inference-server/server/issues/4344
- https://github.com/triton-inference-server/server/issues/5833#issuecomment-1561318646

Reproduced Steps

Aug 21 '23 00:08 Yeom

i meet a similar problem. if ft server encouters stop token during generating, but the already generate tokens' length shorter than the max_new_tokens, the ft server will continue reply the same result, but don't stop the streaming.

client.stop_stream() is called, but it will block until the result's lenth equal the max_new_tokens.

is there any way to get out?

Sep 12 '23 08:09 shanekong

fastertransformer_backend fastertransformer_backend copied to clipboard

Can i stop execution? (w/ `decoupled mode`)

Description

Reproduced Steps

fastertransformer_backend
fastertransformer_backend copied to clipboard