serve
serve copied to clipboard
The handler only gets the correct value from cuda:0.
🐛 Describe the bug
I am using 2 GPU. Torchserve inference returns correct values only for predictions run on cuda:0.
Error logs
x = "text to embed"
url = f"http://localhost:9080/predictions/my-model"
x_emb_1 = requests.post(url, data = x).json()
x_emb_2 = requests.post(url, data = x).json()
x_emb_1 != x_emb_2 # True
Installation instructions
docker pull pytorch/torchserve:latest-gpu
Model Packaing
My handler looks like this
import torch
from pathlib import Path
from ts.torch_handler.base_handler import BaseHandler
from model import MyModel
class MyModel_Handler(BaseHandler):
def __init__(self):
pass
def initialize(self, context):
# load the model
self.manifest = context.manifest
properties = context.system_properties
model_dir = properties.get("model_dir")
self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
serialized_file = self.manifest['model']['serializedFile']
model_pt_path = Path(model_dir) / serialized_file
if not model_pt_path.exists():
raise RuntimeError("Missing the model.pt file")
hparams = {
"load_ckpt": model_pt_path,
"seq_len": 10,
"window_size": 128
}
self.model = MyModel(hparams).to(self.device) # from transformers import AutoModel
self.model = self.model.eval()
self.initialized = True
def preprocess(self, data):
inp = [d.get("body").decode('utf-8') for d in data]
return inp
def inference(self, data):
with torch.no_grad():
results = self.model(data)
return results
def postprocess(self, inference_output):
return inference_output.tolist()
config.properties
'batch_size': 512
'max_batch_delay': 100
'min_worker': 2
'max_worker': 2
else : default setting
Versions
docker-hub: pytorch/torchserve:0.6.0-gpu
Repro instructions
.
Possible Solution
.
Updated the issue.
I thought the handler returns the correct value only in the cuda:0.
However, it turns out that alternating requests return the correct value. In a cycle of 2. Although I run the torchserve container with single gpu.
Here's an example.
x = 'test text'
url = f"http://localhost:9080/predictions/my-model"
emb_x = requests.post(url, data = x).json()
emb_x_1 = requests.post(url, data = x).json()
emb_x == emb_x_1 # False
emb_x_2 = requests.post(url, data = x).json()
emb_x == emb_x_2 # True
emb_x_3 = requests.post(url, data = x).json()
emb_x == emb_x_3 # False
emb_x_4 = requests.post(url, data = x).json()
emb_x == emb_x_4 # True
@YongWookHa Could you please clarify if your setup is single GPU or multiple GPU. Also, could you please share some details on the model ( some example of an open source model) so I can repro it.
Closing since there is no followup. Please re-open when you get a chance