TensorRT-LLM How to load multiple Lora weights and multiple text inputs to inference？

How to load multiple Lora weights and multiple text inputs to inference？ Currently, only single Lora weights and input tokens are supported as inputs. How to support multiple Lora weights and input tokens as inputs for batch inference？ https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/lora_manager.py#L234 When I inference with two text_input and two same lora weights as follow:

mpirun -n 2 --allow-run-as-root python ../run.py --engine_dir "/tmp/new_lora_7b/trt_engines/fp16/2-gpu/" \
              --max_output_len 50 \
              --temperature 1 \
              --tokenizer_dir "/workspace/qllama-7b-chat" \
              --input_text "你好" "你是谁？" \
              --lora_dir "/workspace/offline" \
              --lora_task_uids 0 0 \
              --no_add_special_tokens

the error occurred：

File "/workspace/examples/llama/../run.py", line 285, in main
    outputs = runner.generate(batch_input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 364, in generate
    batch_input_ids, input_lengths = self._prepare_inputs(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 279, in _prepare_inputs
    raise RuntimeError(
RuntimeError: Input batch size (2) exceeds the engine limit (1)
Traceback (most recent call last):
  File "/workspace/examples/llama/../run.py", line 339, in <module>
    main(args)
  File "/workspace/examples/llama/../run.py", line 285, in main
    outputs = runner.generate(batch_input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 364, in generate
    batch_input_ids, input_lengths = self._prepare_inputs(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 279, in _prepare_inputs
    raise RuntimeError(
RuntimeError: Input batch size (2) exceeds the engine limit (1)

Later, I set self.max_batch_size = 2，error occurred:

[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
Traceback (most recent call last):
  File "/workspace/examples/llama/../run.py", line 339, in <module>
    main(args)
  File "/workspace/examples/llama/../run.py", line 285, in main
    outputs = runner.generate(batch_input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 387, in generate
    outputs = self.session.decode(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 639, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2208, in decode
    return self.decode_regular(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1947, in decode_regular
    should_stop, next_step_buffer, tasks, context_lengths, host_context_lengths, attention_mask, logits, encoder_input_lengths = self.handle_per_step(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1712, in handle_per_step
    raise RuntimeError('Executing TRT engine failed!')
RuntimeError: Executing TRT engine failed!

Should I build different engines for different batch sizes?

Dec 07 '23 01:12 jkl375

Could you share the script to build engine?

Dec 07 '23 09:12 byshiue

python build.py --model_dir /workspace/qllama-7b-chat \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir "/tmp/new_lora_7b/trt_engines/fp16/2-gpu/" \
                --max_batch_size 1 \
                --max_input_len 512 \
                --max_output_len 50 \
                --use_lora_plugin float16 \
                --visualize  \
                --hf_lora_dir "/workspace/linker_g0TZGi36_best" \
                --world_size 2 --tp_size 2

/workspace/linker_g0TZGi36_best is lora wight by finetuning

Dec 08 '23 05:12 jkl375

If you want to run batch size > 1, you should set max_batch_size during engine building.

Dec 08 '23 09:12 byshiue

Thank you, but when is it expected to support loading multiple Lora weights？

Dec 08 '23 09:12 jkl375

Using multiple Lora weights independently and merged would be a very important feature.

Dec 09 '23 14:12 codybum

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

Dec 11 '23 06:12 byshiue

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

Jan 04 '24 09:01 WangxuP

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

As I mentioned, users need to modify lora_manager.py to load several lora models.

Jan 05 '24 09:01 byshiue

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

As I mentioned, users need to modify lora_manager.py to load several lora models.

Are there any examples for me to refer to? thank you very much!

Jan 05 '24 13:01 WangxuP

No. We don't find better model to prepare the example. If you could share any checkpoints with several lora weights, we are happy to prepare such example.

Jan 08 '24 03:01 byshiue

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

As I mentioned, users need to modify lora_manager.py to load several lora models.

@byshiue if we modify lora_manager to load multiple lora adapters, do we share the same base model? Or would the model weights for the base model also be repeated?

Jan 23 '24 23:01 hchoi-moveworks

They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.

Jan 30 '24 08:01 byshiue

它们共享相同的基本模型。我们在这里有一个示例https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints。

ok, thanks a lot!

Jan 30 '24 10:01 WangxuP

They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.

Thanks @byshiue !!

In the example 1, build script only specify one hf_lora_dir . Should this be 2 lora dirs?? (To be consistent with run script and the readme description ? )

python build.py --model_dir ${BASE_LLAMA_MODEL} \
              ....
                --hf_lora_dir "Japanese-Alpaca-LoRA-7b-v0/" \

Jan 31 '24 07:01 hchoi-moveworks

They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.

Thanks @byshiue !!

In the example 1, build script only specify one hf_lora_dir . Should this be 2 lora dirs?? (To be consistent with run script and the readme description ? )
python build.py --model_dir ${BASE_LLAMA_MODEL} \
              ....
                --hf_lora_dir "Japanese-Alpaca-LoRA-7b-v0/" \

In engine building, we only use one of lora models to get the common parameters, and users need to setup the max_lora_rank properly based on their lora models. Most lora logics are handled in runtime and that's why we only load multiple lora models in runtime.

Feb 19 '24 07:02 byshiue

@byshiue What is the implementation of LoRA dynamic switching in TensorRT? Shouldn't it be static after being converted to an engine

May 20 '24 04:05 Baboom-l

The LoRA weights are managed by runtime instead of engine. We pass the pointers of LoRA weights as inputs of the TRT engine. So, we could pass different pointers to switch the LoRA weights dynamically.

May 23 '24 07:05 byshiue

TensorRT-LLM TensorRT-LLM copied to clipboard

How to load multiple Lora weights and multiple text inputs to inference？

TensorRT-LLM
TensorRT-LLM copied to clipboard