TensorRT-LLM
TensorRT-LLM copied to clipboard
support multi lora at the same time? like s-lora
it's a very important feature for concurrent inference server
the same issue
Could you share checkpoints to demonstrate?
This feature is supported in TRT LLM, but need few modification for lora_manager to load several lora modules.
bellow is my code by huggingface/peft, all i need is that trt-llm could support running multiple LoRA adapters in a single batch or concurrent requests in a similar fashion to the S-LoRA projects. my model is 'Baichuan2-13B-Chat'
model, tokenizer = (None, None)
def init_model(peft_model_id):
global model, tokenizer
if model is None:
config = PeftConfig.from_pretrained(peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, use_fast = False, trust_remote_code = True)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, device_map = "auto", trust_remote_code = True)
model.generation_config = GenerationConfig.from_pretrained(config.base_model_name_or_path)
#model.generation_config.top_k = 50
#model.generation_config.temperature = 1.01
model = PeftModel.from_pretrained(model, peft_model_id, is_trainable = False)
model = model.merge_and_unload()
model.eval()
We load lora models here https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/lora_manager.py#L242-L251, you could changes the codes to load several lora models and assign them different ids. Then, passing respective id during inference.
e.g.,
lora_model_1 = torch.load(f"{model_dir}/adapter_model_1.bin")
lora_model_2 = torch.load(f"{model_dir}/adapter_model_2.bin")
ranks = [0, lora_model_1.config["r"], lora_model_2.config["r"]]
uids = ["-1", "0", "1"]
and pass different lora id by --lora_task_uids
.
If you could share checkpoints with several lora models, we are happy to support that officially.
@byshiue How to use that content, Is it to import the adapter_model from the main model built with tensorrt-llm? Or, Is it possible to specify hf_lora_path when building with tensorrt-llm, so that it can be used only for the completed lora?
@YooSungHyun I don't really get your question. But lora checkpoint is loaded during runtime. In engine building, we only setup some parameters like what lora modules we need.
@byshiue um... i changed my question. i want to use tensorrt-llm and quantize, LoRA (BF16) Is this the way to do it? case-1
- build tensorrt-llm with quantize my llama2-13B model (BF16 llame2 to 8bit)
- in Trition, just loading lora (BF16) with 8bit bulit model
case-2
- build tensorrt-llm with quantize my llama2-13B model (BF16 llame2 to 8bit)
- It only quantizes lora, which is a feature of tensorrt-llm that I don't know about yet. (BF16 LoRA to 8bit)
- in Trition, just loading 8bit lora with 8bit bulit model
case-3
- tensorrt-llm is not support lora load and unload, in some way that I'm not aware of, I need to include the build up to and including LoRA.
what case is right? i'm confused....
and, i want to using hqq(https://github.com/mobiusml/hqq), that model can not build?
case 1 might work, but we don't have codes on triton side and we don't verify the correctness on quantization.
For quantization, please refer https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/README.md#tensorrt-llm-quantization-toolkit-installation-guide to quantize model. TensorRT-LLM does not support 3rdparty quantized models in most cases. If some cases are supported, they will be listed in the document.
Hi @byshiue, I'm also curious how we can serve multiple LoRa modules from a single server. Does the LoRa manager make any assumption on the model architecture? https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/lora_manager.py#L242-L251.
Is there any plans to support this feature soon? If not, what would be the path to get this supported soon? Thanks!
@juney-nvidia or @byshiue any updates on if we will support this feature soon?
Here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints is example of multi-lora. Hoping it is helpful.
Here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints is example of multi-lora. Hoping it is helpful.
hi, i am very instrested in multi-lora method! What should i do if i need the base model to be quantized? Weight-only quantize the base model while the lora checkpoints remains fp16/bf16? Thank you in advanced!
TRT-LLM does not support quantization on lora case now.
TRT-LLM does not support quantization on lora case now.
So in multi-lora scenario, the base model and lora adpter should both fp16/bf16?
You are right.
This feature is supported. Close this issue.