server
server copied to clipboard
Python Backend: one model instance over multiple GPUs
Problem I am using python backend to deploy LLMs (14B, 28G) on 2 GPUs (16G each). This model is too large to fit on a single GPU. But triton inference server kept creating 2 model instances ( 1 instance per GPU), which resulted in CUDA out of memory error. And instance group seems cannot help on this case.
Features Is it possible to control the number of model instances created in total? in my case, my machine can only support 1 model instance. but this cannot be codeedin model.py and config.pbtxt
more details triton version: tritonserver:23.11-py3 transfomers:
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto"
)
self.tokenizer = AutoTokenizer.from_pretrained(model_path, torch_dtype=torch.float16, padding_side='left')
Hello, what does your model configuration file config.pbtxt
look like? Also Triton's up to 24.03 right now. Is there a reason why you are not using the latest version?
Hello, what does your model configuration file
config.pbtxt
look like? Also Triton's up to 24.03 right now. Is there a reason why you are not using the latest version?
the cloud I am using haven't introduced the newest version of triton. the config.pbtxt looks like this.
name: "llm"
backend: "python"
max_batch_size: 4
input [
{
name: "prompt"
data_type: TYPE_STRING
dims: [1]
}
]
output [
{
name: "generated_text"
data_type: TYPE_STRING
dims: [1]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0, 1]
}
You can use KIND_MODEL and manually control the GPUs you want to use. One workaround would be to add an additional parameter (e.g. gpu_device_ids
) in your model config that specifies the GPU ids and read them in your Python model.