TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

support multi lora at the same time? like s-lora

Open codernew007 opened this issue 1 year ago • 10 comments

it's a very important feature for concurrent inference server

codernew007 avatar Dec 25 '23 10:12 codernew007

the same issue

Hap-Zhang avatar Dec 26 '23 08:12 Hap-Zhang

Could you share checkpoints to demonstrate?

This feature is supported in TRT LLM, but need few modification for lora_manager to load several lora modules.

byshiue avatar Dec 26 '23 09:12 byshiue

bellow is my code by huggingface/peft, all i need is that trt-llm could support running multiple LoRA adapters in a single batch or concurrent requests in a similar fashion to the S-LoRA projects. my model is 'Baichuan2-13B-Chat'

model, tokenizer = (None, None)
def init_model(peft_model_id):
    global model, tokenizer
    if model is None:
        config = PeftConfig.from_pretrained(peft_model_id)
        tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, use_fast = False, trust_remote_code = True)
        model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, device_map = "auto", trust_remote_code = True)
        model.generation_config = GenerationConfig.from_pretrained(config.base_model_name_or_path)
        #model.generation_config.top_k = 50
        #model.generation_config.temperature = 1.01
        model = PeftModel.from_pretrained(model, peft_model_id, is_trainable = False)
        model = model.merge_and_unload()
        model.eval()

codernew007 avatar Dec 26 '23 10:12 codernew007

We load lora models here https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/lora_manager.py#L242-L251, you could changes the codes to load several lora models and assign them different ids. Then, passing respective id during inference.

e.g.,

lora_model_1 = torch.load(f"{model_dir}/adapter_model_1.bin")
lora_model_2 = torch.load(f"{model_dir}/adapter_model_2.bin")
ranks = [0, lora_model_1.config["r"], lora_model_2.config["r"]]
uids = ["-1", "0", "1"]

and pass different lora id by --lora_task_uids.

If you could share checkpoints with several lora models, we are happy to support that officially.

byshiue avatar Dec 27 '23 03:12 byshiue

@byshiue How to use that content, Is it to import the adapter_model from the main model built with tensorrt-llm? Or, Is it possible to specify hf_lora_path when building with tensorrt-llm, so that it can be used only for the completed lora?

YooSungHyun avatar Jan 08 '24 10:01 YooSungHyun

@YooSungHyun I don't really get your question. But lora checkpoint is loaded during runtime. In engine building, we only setup some parameters like what lora modules we need.

byshiue avatar Jan 08 '24 14:01 byshiue

@byshiue um... i changed my question. i want to use tensorrt-llm and quantize, LoRA (BF16) Is this the way to do it? case-1

  1. build tensorrt-llm with quantize my llama2-13B model (BF16 llame2 to 8bit)
  2. in Trition, just loading lora (BF16) with 8bit bulit model

case-2

  1. build tensorrt-llm with quantize my llama2-13B model (BF16 llame2 to 8bit)
  2. It only quantizes lora, which is a feature of tensorrt-llm that I don't know about yet. (BF16 LoRA to 8bit)
  3. in Trition, just loading 8bit lora with 8bit bulit model

case-3

  1. tensorrt-llm is not support lora load and unload, in some way that I'm not aware of, I need to include the build up to and including LoRA.

what case is right? i'm confused....

and, i want to using hqq(https://github.com/mobiusml/hqq), that model can not build?

YooSungHyun avatar Jan 09 '24 01:01 YooSungHyun

case 1 might work, but we don't have codes on triton side and we don't verify the correctness on quantization.

For quantization, please refer https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/README.md#tensorrt-llm-quantization-toolkit-installation-guide to quantize model. TensorRT-LLM does not support 3rdparty quantized models in most cases. If some cases are supported, they will be listed in the document.

byshiue avatar Jan 11 '24 08:01 byshiue

Hi @byshiue, I'm also curious how we can serve multiple LoRa modules from a single server. Does the LoRa manager make any assumption on the model architecture? https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/lora_manager.py#L242-L251.

Is there any plans to support this feature soon? If not, what would be the path to get this supported soon? Thanks!

aravindMahadevan avatar Feb 02 '24 19:02 aravindMahadevan

@juney-nvidia or @byshiue any updates on if we will support this feature soon?

aravindMahadevan avatar Feb 08 '24 15:02 aravindMahadevan

Here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints is example of multi-lora. Hoping it is helpful.

byshiue avatar Feb 19 '24 04:02 byshiue

Here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints is example of multi-lora. Hoping it is helpful.

hi, i am very instrested in multi-lora method! What should i do if i need the base model to be quantized? Weight-only quantize the base model while the lora checkpoints remains fp16/bf16? Thank you in advanced!

Elenore1997 avatar Mar 07 '24 12:03 Elenore1997

TRT-LLM does not support quantization on lora case now.

byshiue avatar Mar 13 '24 06:03 byshiue

TRT-LLM does not support quantization on lora case now.

So in multi-lora scenario, the base model and lora adpter should both fp16/bf16?

Elenore1997 avatar Mar 13 '24 06:03 Elenore1997

You are right.

byshiue avatar Mar 14 '24 09:03 byshiue

This feature is supported. Close this issue.

byshiue avatar Apr 09 '24 08:04 byshiue