serve [question] How to properly handle client request cancelation during inference?

Hey all,

My model's inference is quite long-running (around 50 seconds per request), so it would be great if closed client connections are handled properly by interrupting the inference that's currently in progress. I'm currently implementing initialize, preprocess, inference and postprocess methods in my custom handler class. What's the proper place for detecting closed connection, if possible?

Thanks, Miro

Nov 30 '23 18:11 miroslavLalev

@miroslavLalev There are 2 model level configuration to address long-running inference request.

responseTimeout: this parameter is able to avoid TorchServe frontend disconnect with backend worker (ie. model handler).
clientTimeoutInMills: TorchServe will either skip processing the request if the request is still pending in frontend queue or stop sending response to client if the response is already received from backend worker when client connection timeout/

Both of the parameters can be set in model-config.yaml.

Dec 01 '23 08:12 lxning

@miroslavLalev I tried responseTimeout=5(my model inference time is 10s). after calling torchserve inference endpoint I found log like this:

2024-03-13T23:40:02,848 [ERROR] W-9004-bert4rec_240314-083734 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
2024-03-13T23:40:02,857 [ERROR] W-9004-bert4rec_240314-083734 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:230) [model-server.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
2024-03-13T23:40:02,966 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9004 Worker disconnected. WORKER_MODEL_LOADED

...

2024-03-13T23:40:02,971 [INFO ] W-9004-bert4rec_240314-083734 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1710373202971

But auto recovery is failed again and again...:

2024-03-13T23:41:05,887 [WARN ] W-9004-bert4rec_240314-083734 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again

Is it normal situation or bug? do you know how to fix it?

Mar 13 '24 23:03 gukwonku

@gukwonku please set your responseTimeout to be greater than model inference time.

Also, the worker recovery issue has been fixed and is included in the latest release: https://github.com/pytorch/serve/releases/tag/v0.10.0

Mar 20 '24 22:03 namannandan

serve serve copied to clipboard

[question] How to properly handle client request cancelation during inference?

serve
serve copied to clipboard