fastertransformer_backend
fastertransformer_backend copied to clipboard
Can i stop execution? (w/ `decoupled mode`)
Description
Docker: nvcr.io/nvidia/tritonserver:23.04-py3
Gpu: A100
How can i stop bi-direction streaming(decoupled mode)?
- I want to stop model inference(streaming response) when the user disconnects or according to certain conditions, but I don't know how to do that at the moment.
Reference
- https://github.com/triton-inference-server/server/issues/4344
- https://github.com/triton-inference-server/server/issues/5833#issuecomment-1561318646
Reproduced Steps
-
i meet a similar problem. if ft server encouters stop token during generating, but the already generate tokens' length shorter than the max_new_tokens, the ft server will continue reply the same result, but don't stop the streaming.
client.stop_stream() is called, but it will block until the result's lenth equal the max_new_tokens.
is there any way to get out?