MLServer icon indicating copy to clipboard operation
MLServer copied to clipboard

Select GPU to be used for each worker on parallel inference

Open teddy-ambona opened this issue 1 year ago • 0 comments
trafficstars

I am trying to run a transformer model using parallel inference on 4 workers on a machine that has 4 GPUs. The 4 workers are able to load the model but the issue is that they are all using the same GPU. This is a snippet of the code used to load the model:

import torch
from mlserver import MLModel
from transformers import XLMRobertaForSequenceClassification

class MyCustomRuntime(MLModel):
    async def load(self) -> bool:
        self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
    
        model = XLMRobertaForSequenceClassification.from_pretrained(dir_path)
        self.model = model.to(self.device)
    
        return True

Basically each worker would need to be aware of which GPU it should use for loading the model but I couldn't find a way to do that looking at documentation and source code. Looking forward to your reply :)

teddy-ambona avatar Feb 15 '24 03:02 teddy-ambona