TensorRT-LLM
TensorRT-LLM copied to clipboard
How to load multiple Lora weights and multiple text inputs to inference?
How to load multiple Lora weights and multiple text inputs to inference? Currently, only single Lora weights and input tokens are supported as inputs. How to support multiple Lora weights and input tokens as inputs for batch inference? https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/lora_manager.py#L234 When I inference with two text_input and two same lora weights as follow:
mpirun -n 2 --allow-run-as-root python ../run.py --engine_dir "/tmp/new_lora_7b/trt_engines/fp16/2-gpu/" \
--max_output_len 50 \
--temperature 1 \
--tokenizer_dir "/workspace/qllama-7b-chat" \
--input_text "你好" "你是谁?" \
--lora_dir "/workspace/offline" \
--lora_task_uids 0 0 \
--no_add_special_tokens
the error occurred:
File "/workspace/examples/llama/../run.py", line 285, in main
outputs = runner.generate(batch_input_ids,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 364, in generate
batch_input_ids, input_lengths = self._prepare_inputs(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 279, in _prepare_inputs
raise RuntimeError(
RuntimeError: Input batch size (2) exceeds the engine limit (1)
Traceback (most recent call last):
File "/workspace/examples/llama/../run.py", line 339, in <module>
main(args)
File "/workspace/examples/llama/../run.py", line 285, in main
outputs = runner.generate(batch_input_ids,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 364, in generate
batch_input_ids, input_lengths = self._prepare_inputs(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 279, in _prepare_inputs
raise RuntimeError(
RuntimeError: Input batch size (2) exceeds the engine limit (1)
Later, I set self.max_batch_size = 2,error occurred:
[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
Traceback (most recent call last):
File "/workspace/examples/llama/../run.py", line 339, in <module>
main(args)
File "/workspace/examples/llama/../run.py", line 285, in main
outputs = runner.generate(batch_input_ids,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 387, in generate
outputs = self.session.decode(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 639, in wrapper
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2208, in decode
return self.decode_regular(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1947, in decode_regular
should_stop, next_step_buffer, tasks, context_lengths, host_context_lengths, attention_mask, logits, encoder_input_lengths = self.handle_per_step(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1712, in handle_per_step
raise RuntimeError('Executing TRT engine failed!')
RuntimeError: Executing TRT engine failed!
Should I build different engines for different batch sizes?
Could you share the script to build engine?
python build.py --model_dir /workspace/qllama-7b-chat \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--output_dir "/tmp/new_lora_7b/trt_engines/fp16/2-gpu/" \
--max_batch_size 1 \
--max_input_len 512 \
--max_output_len 50 \
--use_lora_plugin float16 \
--visualize \
--hf_lora_dir "/workspace/linker_g0TZGi36_best" \
--world_size 2 --tp_size 2
/workspace/linker_g0TZGi36_best is lora wight by finetuning
If you want to run batch size > 1, you should set max_batch_size
during engine building.
Thank you, but when is it expected to support loading multiple Lora weights?
Using multiple Lora weights independently and merged would be a very important feature.
The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.
The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.
I read lora_manager.py
source code. It was found that the LoraConfig
class can only load one Lora model and does not show the ability to load multiple lora models.
The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.
I read
lora_manager.py
source code. It was found that theLoraConfig
class can only load one Lora model and does not show the ability to load multiple lora models.
As I mentioned, users need to modify lora_manager.py
to load several lora models.
The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.
I read
lora_manager.py
source code. It was found that theLoraConfig
class can only load one Lora model and does not show the ability to load multiple lora models.As I mentioned, users need to modify
lora_manager.py
to load several lora models.
Are there any examples for me to refer to? thank you very much!
No. We don't find better model to prepare the example. If you could share any checkpoints with several lora weights, we are happy to prepare such example.
The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.
I read
lora_manager.py
source code. It was found that theLoraConfig
class can only load one Lora model and does not show the ability to load multiple lora models.As I mentioned, users need to modify
lora_manager.py
to load several lora models.
@byshiue if we modify lora_manager
to load multiple lora adapters, do we share the same base model? Or would the model weights for the base model also be repeated?
They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.
它们共享相同的基本模型。我们在这里有一个示例https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints。
ok, thanks a lot!
They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.
Thanks @byshiue !!
In the example 1, build
script only specify one
hf_lora_dir
. Should this be 2 lora dirs?? (To be consistent with run
script and the readme description ? )
python build.py --model_dir ${BASE_LLAMA_MODEL} \
....
--hf_lora_dir "Japanese-Alpaca-LoRA-7b-v0/" \
They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.
Thanks @byshiue !!
In the example 1,
build
script only specifyone
hf_lora_dir
. Should this be 2 lora dirs?? (To be consistent withrun
script and the readme description ? )python build.py --model_dir ${BASE_LLAMA_MODEL} \ .... --hf_lora_dir "Japanese-Alpaca-LoRA-7b-v0/" \
In engine building, we only use one of lora models to get the common parameters, and users need to setup the max_lora_rank properly based on their lora models. Most lora logics are handled in runtime and that's why we only load multiple lora models in runtime.
@byshiue What is the implementation of LoRA dynamic switching in TensorRT? Shouldn't it be static after being converted to an engine
The LoRA weights are managed by runtime instead of engine. We pass the pointers of LoRA weights as inputs of the TRT engine. So, we could pass different pointers to switch the LoRA weights dynamically.