TensorRT-LLM support multi lora at the same time? like s-lora

it's a very important feature for concurrent inference server

Dec 25 '23 10:12 codernew007

the same issue

Dec 26 '23 08:12 Hap-Zhang

Could you share checkpoints to demonstrate?

This feature is supported in TRT LLM, but need few modification for lora_manager to load several lora modules.

Dec 26 '23 09:12 byshiue

bellow is my code by huggingface/peft, all i need is that trt-llm could support running multiple LoRA adapters in a single batch or concurrent requests in a similar fashion to the S-LoRA projects. my model is 'Baichuan2-13B-Chat'

model, tokenizer = (None, None)
def init_model(peft_model_id):
    global model, tokenizer
    if model is None:
        config = PeftConfig.from_pretrained(peft_model_id)
        tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, use_fast = False, trust_remote_code = True)
        model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, device_map = "auto", trust_remote_code = True)
        model.generation_config = GenerationConfig.from_pretrained(config.base_model_name_or_path)
        #model.generation_config.top_k = 50
        #model.generation_config.temperature = 1.01
        model = PeftModel.from_pretrained(model, peft_model_id, is_trainable = False)
        model = model.merge_and_unload()
        model.eval()

Dec 26 '23 10:12 codernew007

We load lora models here https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/lora_manager.py#L242-L251, you could changes the codes to load several lora models and assign them different ids. Then, passing respective id during inference.

e.g.,

lora_model_1 = torch.load(f"{model_dir}/adapter_model_1.bin")
lora_model_2 = torch.load(f"{model_dir}/adapter_model_2.bin")
ranks = [0, lora_model_1.config["r"], lora_model_2.config["r"]]
uids = ["-1", "0", "1"]

and pass different lora id by --lora_task_uids.

If you could share checkpoints with several lora models, we are happy to support that officially.

Dec 27 '23 03:12 byshiue

@byshiue How to use that content, Is it to import the adapter_model from the main model built with tensorrt-llm? Or, Is it possible to specify hf_lora_path when building with tensorrt-llm, so that it can be used only for the completed lora?

Jan 08 '24 10:01 YooSungHyun

@YooSungHyun I don't really get your question. But lora checkpoint is loaded during runtime. In engine building, we only setup some parameters like what lora modules we need.

Jan 08 '24 14:01 byshiue

@byshiue um... i changed my question. i want to use tensorrt-llm and quantize, LoRA (BF16) Is this the way to do it? case-1

build tensorrt-llm with quantize my llama2-13B model (BF16 llame2 to 8bit)
in Trition, just loading lora (BF16) with 8bit bulit model

case-2

build tensorrt-llm with quantize my llama2-13B model (BF16 llame2 to 8bit)
It only quantizes lora, which is a feature of tensorrt-llm that I don't know about yet. (BF16 LoRA to 8bit)
in Trition, just loading 8bit lora with 8bit bulit model

case-3

tensorrt-llm is not support lora load and unload, in some way that I'm not aware of, I need to include the build up to and including LoRA.

what case is right? i'm confused....

and, i want to using hqq(https://github.com/mobiusml/hqq), that model can not build?

Jan 09 '24 01:01 YooSungHyun

case 1 might work, but we don't have codes on triton side and we don't verify the correctness on quantization.

For quantization, please refer https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/README.md#tensorrt-llm-quantization-toolkit-installation-guide to quantize model. TensorRT-LLM does not support 3rdparty quantized models in most cases. If some cases are supported, they will be listed in the document.

Jan 11 '24 08:01 byshiue

Hi @byshiue, I'm also curious how we can serve multiple LoRa modules from a single server. Does the LoRa manager make any assumption on the model architecture? https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/lora_manager.py#L242-L251.

Is there any plans to support this feature soon? If not, what would be the path to get this supported soon? Thanks!

Feb 02 '24 19:02 aravindMahadevan

@juney-nvidia or @byshiue any updates on if we will support this feature soon?

Feb 08 '24 15:02 aravindMahadevan

Here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints is example of multi-lora. Hoping it is helpful.

Feb 19 '24 04:02 byshiue

Here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints is example of multi-lora. Hoping it is helpful.

hi, i am very instrested in multi-lora method! What should i do if i need the base model to be quantized? Weight-only quantize the base model while the lora checkpoints remains fp16/bf16? Thank you in advanced!

Mar 07 '24 12:03 Elenore1997

TRT-LLM does not support quantization on lora case now.

Mar 13 '24 06:03 byshiue

TRT-LLM does not support quantization on lora case now.

So in multi-lora scenario, the base model and lora adpter should both fp16/bf16?

Mar 13 '24 06:03 Elenore1997

You are right.

Mar 14 '24 09:03 byshiue

This feature is supported. Close this issue.

Apr 09 '24 08:04 byshiue

TensorRT-LLM TensorRT-LLM copied to clipboard

support multi lora at the same time? like s-lora

TensorRT-LLM
TensorRT-LLM copied to clipboard