MLServer
MLServer copied to clipboard
Select GPU to be used for each worker on parallel inference
trafficstars
I am trying to run a transformer model using parallel inference on 4 workers on a machine that has 4 GPUs. The 4 workers are able to load the model but the issue is that they are all using the same GPU. This is a snippet of the code used to load the model:
import torch
from mlserver import MLModel
from transformers import XLMRobertaForSequenceClassification
class MyCustomRuntime(MLModel):
async def load(self) -> bool:
self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = XLMRobertaForSequenceClassification.from_pretrained(dir_path)
self.model = model.to(self.device)
return True
Basically each worker would need to be aware of which GPU it should use for loading the model but I couldn't find a way to do that looking at documentation and source code. Looking forward to your reply :)