server Python Backend: one model instance over multiple GPUs

Python Backend: one model instance over multiple GPUs

Open CollinHU opened this issue 9 months ago • 2 comments

Problem I am using python backend to deploy LLMs (14B, 28G) on 2 GPUs (16G each). This model is too large to fit on a single GPU. But triton inference server kept creating 2 model instances ( 1 instance per GPU), which resulted in CUDA out of memory error. And instance group seems cannot help on this case.

Features Is it possible to control the number of model instances created in total? in my case, my machine can only support 1 model instance. but this cannot be codeedin model.py and config.pbtxt

more details triton version: tritonserver:23.11-py3 transfomers:

self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="auto"
        )
self.tokenizer = AutoTokenizer.from_pretrained(model_path, torch_dtype=torch.float16, padding_side='left')

Apr 29 '24 02:04 CollinHU

Hello, what does your model configuration file config.pbtxt look like? Also Triton's up to 24.03 right now. Is there a reason why you are not using the latest version?

Apr 30 '24 23:04 jbkyang-nvi

Hello, what does your model configuration file config.pbtxt look like? Also Triton's up to 24.03 right now. Is there a reason why you are not using the latest version?

the cloud I am using haven't introduced the newest version of triton. the config.pbtxt looks like this.

name: "llm" backend: "python" max_batch_size: 4 input [ { name: "prompt" data_type: TYPE_STRING
dims: [1] } ] output [ { name: "generated_text" data_type: TYPE_STRING
dims: [1] } ] instance_group [ { count: 1 kind: KIND_GPU gpus: [0, 1] }

May 01 '24 08:05 CollinHU

You can use KIND_MODEL and manually control the GPUs you want to use. One workaround would be to add an additional parameter (e.g. gpu_device_ids) in your model config that specifies the GPU ids and read them in your Python model.

Sep 06 '24 14:09 Tabrizian

server server copied to clipboard

Python Backend: one model instance over multiple GPUs

server
server copied to clipboard