Torchserve - Work stopped

Open psantiago-lsbd opened this issue 1 year ago • 0 comments

Bug Description A few months ago, I was using MMDetection version 2.28.2 and utilized torchserve to deploy my models, and it worked perfectly. In recent weeks, it was decided in the project I work on that we should update MMDetection to the latest version (3.3.0) and continue deploying with torchserve, updating it if necessary. I trained object detection models with the same architectures that I had previously deployed in the earlier version of MMDetection to validate the update in the training and deployment workflow. The training was successful, and I can make predictions on images with the init_detector module; however, I encountered issues when trying to deploy with torchserve.

I'm following the documentation for the deployment attempt: [model serving(https://mmdetection.readthedocs.io/en/v2.19.1/useful_tools.html#build-mmdet-serve-docker-image)

Reproduction

Clone mmdetection github and install it.
Generate a ".mar" file.

python tools/deployment/mmdet2torchserve.py ${CONFIG_FILE} ${CHECKPOINT_FILE} \
--output-folder ${MODEL_STORE} \
--model-name ${MODEL_NAME}

Build mmdet-server docker image.

docker build -t mmdet-serve:latest docker/serve/

Run mmdet-server.

docker run --rm \
--cpus 8 \
--gpus device=0 \
-p8080:8080 -p8081:8081 -p8082:8082 \
--mount type=bind,source=$MODEL_STORE,target=/home/model-server/model-store \
mmdet-serve:latest

Environment

Docker:

ARG PYTORCH="1.9.0"
ARG CUDA="11.1"
ARG CUDNN="8"
FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel

ARG MMCV="2.0.0rc4"
ARG MMDET="3.3.0"

MMDetection: 3.3.0

OS: Ubuntu 22.04

Error traceback I can’t make predictions at the model endpoint served with torchserve, and when I checked the logs, I got the following error:

2024-08-21 11:49:02 2024-08-21T14:49:02,797 [INFO ] W-9002-swin_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9002 in 89 seconds.
2024-08-21 11:49:02 2024-08-21T14:49:02,799 [WARN ] W-9004-swin_1.0-stderr MODEL_LOG - Traceback (most recent call last):
2024-08-21 11:49:02 2024-08-21T14:49:02,799 [WARN ] W-9004-swin_1.0-stderr MODEL_LOG -   File "/opt/conda/lib/python3.7/site-packages/ts/model_service_worker.py", line 15, in <module>
2024-08-21 11:49:02 2024-08-21T14:49:02,800 [WARN ] W-9004-swin_1.0-stderr MODEL_LOG -     from ts.async_service import AsyncService
2024-08-21 11:49:02 2024-08-21T14:49:02,800 [WARN ] W-9004-swin_1.0-stderr MODEL_LOG -   File "<fstring>", line 1
2024-08-21 11:49:02 2024-08-21T14:49:02,800 [WARN ] W-9004-swin_1.0-stderr MODEL_LOG -     (self._entry_point=)
2024-08-21 11:49:02 2024-08-21T14:49:02,800 [WARN ] W-9004-swin_1.0-stderr MODEL_LOG -                       ^
2024-08-21 11:49:02 2024-08-21T14:49:02,800 [WARN ] W-9004-swin_1.0-stderr MODEL_LOG - SyntaxError: invalid syntax
2024-08-21 11:49:02 2024-08-21T14:49:02,802 [WARN ] W-9000-swin_1.0-stderr MODEL_LOG - Traceback (most recent call last):
2024-08-21 11:49:02 2024-08-21T14:49:02,803 [WARN ] W-9000-swin_1.0-stderr MODEL_LOG -   File "/opt/conda/lib/python3.7/site-packages/ts/model_service_worker.py", line 15, in <module>
2024-08-21 11:49:02 2024-08-21T14:49:02,803 [WARN ] W-9000-swin_1.0-stderr MODEL_LOG -     from ts.async_service import AsyncService
2024-08-21 11:49:02 2024-08-21T14:49:02,803 [WARN ] W-9000-swin_1.0-stderr MODEL_LOG -   File "<fstring>", line 1
2024-08-21 11:49:02 2024-08-21T14:49:02,803 [WARN ] W-9000-swin_1.0-stderr MODEL_LOG -     (self._entry_point=)
2024-08-21 11:49:02 2024-08-21T14:49:02,803 [WARN ] W-9000-swin_1.0-stderr MODEL_LOG -                       ^
2024-08-21 11:49:02 2024-08-21T14:49:02,803 [WARN ] W-9000-swin_1.0-stderr MODEL_LOG - SyntaxError: invalid syntax
2024-08-21 11:49:02 2024-08-21T14:49:02,808 [INFO ] W-9004-swin_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9004-swin_1.0-stderr
2024-08-21 11:49:02 2024-08-21T14:49:02,808 [INFO ] W-9004-swin_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9004-swin_1.0-stdout
2024-08-21 11:49:02 2024-08-21T14:49:02,808 [ERROR] W-9004-swin_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
2024-08-21 11:49:02 org.pytorch.serve.wlm.WorkerInitializationException: Backend stream closed.
2024-08-21 11:49:02     at org.pytorch.serve.wlm.WorkerLifeCycle.startWorkerPython(WorkerLifeCycle.java:204) ~[model-server.jar:?]
2024-08-21 11:49:02     at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:106) ~[model-server.jar:?]
2024-08-21 11:49:02     at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:362) ~[model-server.jar:?]
2024-08-21 11:49:02     at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:190) [model-server.jar:?]
2024-08-21 11:49:02     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
2024-08-21 11:49:02     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
2024-08-21 11:49:02     at java.lang.Thread.run(Thread.java:829) [?:?]
2024-08-21 11:49:02 2024-08-21T14:49:02,808 [DEBUG] W-9004-swin_1.0 org.pytorch.serve.wlm.WorkerThread - W-9004-swin_1.0 State change WORKER_STOPPED -> WORKER_STOPPED
2024-08-21 11:49:02 2024-08-21T14:49:02,809 [WARN ] W-9004-swin_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again

Aug 21 '24 18:08 psantiago-lsbd