serve
serve copied to clipboard
TorchServe no prediction when input data gets bigger (Backend worker did not respond in given time)
🐛 Describe the bug
I am passing JSON data to python-requests. For simplification you can assume the following input:
dic1 = {"main": "this is a main", "categories": "this is a categories"}
count = 2
input_data = [dic1 for i in range(count)]
response = requests.post(url, json=input_data)
Issue:
When the count<= 8
--> it is working well.
As soon as count>8
--> it stuck and never returns.
as you see the input is just a simple python dictionary and if I set input_data = [dic1 for i in range(10)]
the final size of the input would be very small.
I am using:
- custom handler
- ML trained model is based on simpletransformer
- Ubuntu 22.04 and 20.04
- GPU (local: RTX 3060, Kubernetes: T4)
- I have tested on local machines and Kubernetes. The issue is the same
When the issue shows itself: TorchServe: on the GPU, it is critically dependent on the input data size.
When it works well:
TorchServe: On the CPU, it is working well regardless of the size of input.
PyTorch without TorchServe: I have tested it on PyTorch it is working well even when I pass input_data = [dic1 for i in range(1000)]
Error logs
2022-08-23 16:34:14,800 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-hardnews_1.0 State change null -> WORKER_STARTED
2022-08-23 16:34:14,804 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2022-08-23 16:34:22,821 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 7931
2022-08-23 16:34:22,821 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-hardnews_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2022-08-23 16:34:48,850 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 1093
2022-08-23 16:34:48,852 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 239159, Backend time ns: 1094847821
2022-08-23 16:34:53,317 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 156
2022-08-23 16:34:53,318 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 144077, Backend time ns: 157878185
2022-08-23 16:35:01,126 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 224
2022-08-23 16:35:01,127 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 140180, Backend time ns: 225271057
2022-08-23 16:35:38,326 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 30000
2022-08-23 16:35:38,326 [ERROR] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
2022-08-23 16:35:38,327 [ERROR] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:198)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
2022-08-23 16:35:38,328 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
2022-08-23 16:35:38,335 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.job.Job - Waiting time ns: 64717, Inference time ns: 30009467849
2022-08-23 16:35:38,335 [DEBUG] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-hardnews_1.0 State change WORKER_MODEL_LOADED -> WORKER_STOPPED
2022-08-23 16:35:38,335 [WARN ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-hardnews_1.0-stderr
2022-08-23 16:35:38,335 [WARN ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-hardnews_1.0-stdout
2022-08-23 16:35:38,336 [INFO ] W-9000-hardnews_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
Installation instructions
I don't use docker for installation
Model Packaing
# Running archiver
torch-model-archiver -f --model-name model \
--version 1.0 \
--serialized-file model_folder/pytorch_model.bin \
--export-path model-store \
--requirements-file requirements.txt \
--extra-files "model_folder/config.json,model_folder/merges.txt,model_folder/model_args.json,model_folder/special_tokens_map.json,model_folder/tokenizer.json,model_folder/tokenizer_config.json,model_folder/training_args.bin,model_folder/vocab.json" \
--handler handler.py
config.properties
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
install_py_dep_per_model=true
NUM_WORKERS=1
number_of_gpu=1
number_of_netty_threads=4
netty_client_threads=1
MKL_NUM_THREADS=1
batch_size=1
max_batch_delay=10
job_queue_size=1000
model_store=/home/model-server/shared/model-store
model_snapshot={"name": "startup.cfg","modelCount": 1,"models": {"news": {"1.0": {"defaultVersion": true,"marName": "news.mar","minWorkers": 1,"maxWorkers": 1,"batchSize": 1,"maxBatchDelay": 10,"responseTimeout": 120}}}}
- I have tried different values for threading, but not helpful
Versions
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch:
torchserve==0.4.0b20210521
torch-model-archiver==0.4.0b20210521
Python version: 3.8 (64-bit runtime)
Python executable: /home/ted/anaconda3/envs/myland/bin/python3
Versions of relevant python libraries:
captum==0.5.0
future==0.18.2
numpy==1.23.1
psutil==5.9.1
pytest==4.6.11
pytest-forked==1.4.0
pytest-timeout==1.4.2
pytest-xdist==1.34.0
requests==2.28.1
requests-mock==1.9.3
requests-oauthlib==1.3.1
sentencepiece==0.1.95
simpletransformers==0.62.0
torch==1.12.1
torch-model-archiver==0.4.0b20210521
torch-workflow-archiver==0.1.0b20210521
torchaudio==0.12.1
torchserve==0.4.0b20210521
torchvision==0.13.1
transformers==4.20.1
wheel==0.37.1
torch==1.12.1
**Warning: torchtext not present ..
torchvision==0.13.1
torchaudio==0.12.1
Java Version:
OS: Ubuntu 22.04 LTS
GCC version: (Ubuntu 11.2.0-19ubuntu1) 11.2.0
Clang version: N/A
CMake version: N/A
Is CUDA available: Yes
CUDA runtime version: N/A
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia driver version: 515.65.01
cuDNN version: None
Repro instructions
# Running archiver
torch-model-archiver -f --model-name model \
--version 1.0 \
--serialized-file model_folder/pytorch_model.bin \
--export-path model-store \
--requirements-file requirements.txt \
--extra-files "model_folder/config.json,model_folder/merges.txt,model_folder/model_args.json,model_folder/special_tokens_map.json,model_folder/tokenizer.json,model_folder/tokenizer_config.json,model_folder/training_args.bin,model_folder/vocab.json" \
--handler handler.py
torchserve --start --model-store model-store --models model=hardnews --ncs --ts-config config.properties
running prediction
Possible Solution
No response
- The reason behind sending a list of input data instead of single input: The computational performance
- I have increased responseTimeout, but it does not help
Taking a look. Will get back to you
@tednaseri Can you please share how you are pre-processing and running inference on the input data. In another usecase, I have tried sending batch of images (10) as json data and processing them in a single batch and this works. So, I would need more details on your implementation to repro this. For example, it would be great if you can use the HuggingFace transformer example given in the README to modify the custom handler and see if you are able to repro the problem. Thats the example I am going to try.
@agunapal Thank you so much for the response. For an easier communication, I have tried to simplify the custom handler while it repros the problem. For this purpose, I imagine that the input data is just a digit, then the handler makes a dummy input as follows:
handler(input_number):
data = ["sample text" for i in range(input_number)]
model.predict(data)
Using this handler, it still faces the issue. Here is the prepared handler:
from abc import ABC
import logging
import torch
import transformers
from simpletransformers.classification import ClassificationModel
from ts.torch_handler.base_handler import BaseHandler
logger = logging.getLogger(__name__)
logger.info("Transformers version %s", transformers.__version__)
class TransformersCustomHandler(BaseHandler, ABC):
def __init__(self):
super(TransformersCustomHandler, self).__init__()
self.initialized = False
def initialize(self, context):
self.context = context
self.manifest = context.manifest
properties = context.system_properties
self.model_folder = properties.get("model_dir")
if torch.cuda.is_available() and properties.get("gpu_id") is not None:
self.device = torch.device("cuda:" + str(properties.get("gpu_id")))
self.use_cuda = True
else:
self.device = torch.device("cpu")
self.use_cuda = False
self.predictions = []
self.labels = ['no', 'yes']
self.model = self.load_model()
# The following line does not work for simple transformer
# self.model.to(self.device)
# self.model.eval()
self.initialized = True
def load_model(self):
model = ClassificationModel('roberta', self.model_folder, use_cuda=self.use_cuda)
return model
def predict(self, param):
self.predictions = []
count = param[0]["count"].decode("utf-8")
count = int(count)
input_text = "sample text"
data = [input_text for i in range(count)]
preds, out_results = self.model.predict(data)
label_lst = [self.labels[i] for i in preds]
for i in range(len(label_lst)):
prediction = {"label": label_lst[i]}
self.predictions.append(prediction)
def get_predictions(self):
return self.predictions
# _service = TransformersCustomHandler()
def handle(self, data, context):
try:
# if not _service.initialized:
# _service.initialize(context)
#
# if data is None:
# return None
self.predict(data)
result = [self.get_predictions()]
return result
except Exception as e:
raise e
@tednaseri I used the below handler and I tried with json payloads of length 1000 with T4 GPU. It works
`from abc import ABC import logging import torch import transformers from transformers import ( AutoModelForSequenceClassification, AutoTokenizer, )
from ts.torch_handler.base_handler import BaseHandler
logger = logging.getLogger(name) logger.info("Transformers version %s", transformers.version)
class TransformersHandler(BaseHandler, ABC): """ Transformers handler class for sequence classification. """
def __init__(self):
super(TransformersHandler, self).__init__()
self.initialized = False
def initialize(self, ctx):
"""In this initialize function, the BERT model is loaded
Args:
ctx (context): It is a JSON Object containing information
pertaining to the model artefacts parameters.
"""
self.manifest = ctx.manifest
properties = ctx.system_properties
model_dir = properties.get("model_dir")
self.device = torch.device(
"cuda:" + str(properties.get("gpu_id"))
if torch.cuda.is_available() and properties.get("gpu_id") is not None
else "cpu"
)
self.model = AutoModelForSequenceClassification.from_pretrained(
model_dir
)
self.model.to(self.device)
self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased",
do_lower_case=True
)
self.model.eval()
logger.info("Transformer model from path %s loaded successfully", model_dir)
def preprocess(self, requests):
"""Basic text preprocessing, based on the user's chocie of application mode.
Args:
requests (str): The Input data in the form of text is passed on to the preprocess
function.
Returns:
list : The preprocess function returns a list of Tensor for the size of the word tokens.
"""
inputs = None
for idx, data in enumerate(requests):
input_text = data.get("data") or data.get("body")
input_text = input_text["text"]
inputs = self.tokenizer(input_text, return_tensors="pt")
return inputs
def inference(self, data, *args, **kwargs):
"""
The Inference Function is used to make a prediction call on the given input request.
The user needs to override the inference function to customize it.
Args:
data (Torch Tensor): A Torch Tensor is passed to make the Inference Request.
The shape should match the model input shape.
Returns:
Torch Tensor : The Predicted Torch Tensor is returned in this function.
"""
mask = data['attention_mask'].to(self.device)
input_id = data['input_ids'].squeeze(1).to(self.device)
with torch.no_grad():
results = self.model(input_id, mask)
return results
def postprocess(self, data):
result = data.logits.argmax(dim=1)
result = result.tolist()
return [result]
`
Here is the client part ` import requests import json
api = "http://127.0.0.1:8080/predictions/my_tc" headers = {'Content-type': 'application/json', 'Accept': 'text/plain'}
payload = {"text":["Bloomberg has decided to publish a new report on the global economy." for i in range(1000)]}
payload = json.dumps(payload) response = requests.post(api, data=payload, headers=headers)
print(response.content.decode("UTF-8")) `
@agunapal Thank you so much for the response. Your test shows that the input sample of 1000 works.
By the way, there are some differences that I cannot get that much from the response.
- I am using SimpleTransformer while the test case is based on AutoModelForSequenceClassification.
- AutoModelForSequenceClassification accept model.eval() while the SimpleTransformer does not.
- I am passing the full text, while the test case passes the tokens.
- Could you test any SimpleTransformer model on text classifier? I am wondering if there is full compatibility between the GPU processing of SimpleTransformer and TorchServe. Because:
- My sample code works well on CPU
- The model is able to process 1000 of text on PyTorch but not in TorchServe.
Maybe I need to switch to fastAPI and manual serving.
Hi @agunapal, I have tested another transformer model with the same custom handler, there is no issue there. I think that the issue is an incompatibility between SimpleTransformer and PyTorch. I am wondering, have you ever tested any SimpleTransformer model with PyTorch for text classification?
@tednaseri I am not sure if this has been tested. If you think SimpleTransformers add value and want to create an example showing the integration, please feel free to create a PR and get feedback.
@tednaseri I am dealing the same issue right now. Can you share with me what was the solution?