serve The handler only gets the correct value from cuda:0.

🐛 Describe the bug

I am using 2 GPU. Torchserve inference returns correct values only for predictions run on cuda:0.

Error logs

x = "text to embed"

url = f"http://localhost:9080/predictions/my-model"
x_emb_1 = requests.post(url, data = x).json()
x_emb_2 = requests.post(url, data = x).json()

x_emb_1 != x_emb_2 # True

Installation instructions

docker pull pytorch/torchserve:latest-gpu

Model Packaing

My handler looks like this

import torch
from pathlib import Path
from ts.torch_handler.base_handler import BaseHandler

from model import MyModel

class MyModel_Handler(BaseHandler):
    def __init__(self):
        pass

    def initialize(self, context):
        # load the model
        self.manifest = context.manifest
        properties = context.system_properties
        model_dir = properties.get("model_dir")
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")

        serialized_file = self.manifest['model']['serializedFile']
        model_pt_path = Path(model_dir) / serialized_file
        if not model_pt_path.exists():
            raise RuntimeError("Missing the model.pt file")
        hparams = {
            "load_ckpt": model_pt_path,
            "seq_len": 10,
            "window_size": 128
        }
        self.model = MyModel(hparams).to(self.device)  # from transformers import AutoModel
        self.model = self.model.eval()
        
        self.initialized = True

    def preprocess(self, data):
        inp = [d.get("body").decode('utf-8') for d in data]
        return inp
    
    def inference(self, data):
        with torch.no_grad():
            results = self.model(data)
        return results
    
    def postprocess(self, inference_output):
        return inference_output.tolist()

config.properties

'batch_size': 512
'max_batch_delay': 100
'min_worker': 2
'max_worker': 2

else : default setting

Versions

docker-hub: pytorch/torchserve:0.6.0-gpu

Repro instructions

.

Possible Solution

.

Jul 25 '22 20:07 YongWookHa

Updated the issue. I thought the handler returns the correct value only in the cuda:0. However, it turns out that alternating requests return the correct value. In a cycle of 2. Although I run the torchserve container with single gpu. Here's an example.

x = 'test text'
url = f"http://localhost:9080/predictions/my-model"

emb_x = requests.post(url, data = x).json() 

emb_x_1 = requests.post(url, data = x).json() 
emb_x == emb_x_1  # False
 
emb_x_2 = requests.post(url, data = x).json()
emb_x == emb_x_2  # True

emb_x_3 = requests.post(url, data = x).json()
emb_x == emb_x_3  # False

emb_x_4 = requests.post(url, data = x).json() 
emb_x == emb_x_4  # True

Jul 26 '22 06:07 YongWookHa

@YongWookHa Could you please clarify if your setup is single GPU or multiple GPU. Also, could you please share some details on the model ( some example of an open source model) so I can repro it.

Jul 26 '22 21:07 agunapal

Closing since there is no followup. Please re-open when you get a chance

Aug 22 '22 17:08 agunapal

serve serve copied to clipboard

The handler only gets the correct value from cuda:0.

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

serve
serve copied to clipboard