serve icon indicating copy to clipboard operation
serve copied to clipboard

Handle invalid json formatted requests without killing the worker process

Open duk0011 opened this issue 3 years ago • 1 comments

🚀 The feature

Current Setup: Currently, the TorchServe backend worker process dies whenever it receives an invalid json formatted requests.

Feature Requested: Instead of killing the backend worker process, TorchServe could handle this failure gracefully within try block, log the failing payload and send back any relevant error code (may be 400?).

Test Query: "Blue shirt!

Point of failure within Torchserve: https://github.com/pytorch/serve/blob/master/ts/protocol/otf_message_handler.py#L314

Failure Traceback: 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - Backend worker process died. 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - Traceback (most recent call last): 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/ts/model_service_worker.py", line 210, in 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - worker.run_server() 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/ts/model_service_worker.py", line 181, in run_server 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket) 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/ts/model_service_worker.py", line 132, in handle_connection 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - cmd, msg = retrieve_msg(cl_socket) 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/ts/protocol/otf_message_handler.py", line 36, in retrieve_msg 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - msg = _retrieve_inference_msg(conn) 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/ts/protocol/otf_message_handler.py", line 226, in _retrieve_inference_msg 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - request = _retrieve_request(conn) 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/ts/protocol/otf_message_handler.py", line 261, in _retrieve_request 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - input_data = _retrieve_input_data(conn) 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.8/site-packages/ts/protocol/otf_message_handler.py", line 314, in _retrieve_input_data 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - model_input["value"] = json.loads(value.decode("utf-8")) 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - File "/usr/lib/python3.8/json/init.py", line 357, in loads 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - return _default_decoder.decode(s) 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - File "/usr/lib/python3.8/json/decoder.py", line 337, in decode 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - obj, end = self.scan_once(s, idx) 2022-07-27T15:30:41,172 [INFO ] W-9000-el1001_1.0-stdout MODEL_LOG - json.decoder.JSONDecodeError: Invalid control character at: line 2 column 28 (char 29) 2022-07-27T15:30:41,567 [ERROR] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - Unknown exception io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2022-07-27T15:30:41,567 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

Motivation, pitch

Once the worker dies, it takes sometime to start back and any incoming valid requests during that time will be impacted.
This may affect our production services using torch serve.

Alternatives

No response

Additional context

No response

duk0011 avatar Aug 01 '22 03:08 duk0011

@duk0011 Thank you for the report. I add this into TS backlog.

lxning avatar Aug 02 '22 17:08 lxning