TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

How to load multiple Lora weights and multiple text inputs to inference?

Open jkl375 opened this issue 1 year ago • 17 comments

How to load multiple Lora weights and multiple text inputs to inference? Currently, only single Lora weights and input tokens are supported as inputs. How to support multiple Lora weights and input tokens as inputs for batch inference? https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/lora_manager.py#L234 When I inference with two text_input and two same lora weights as follow:

mpirun -n 2 --allow-run-as-root python ../run.py --engine_dir "/tmp/new_lora_7b/trt_engines/fp16/2-gpu/" \
              --max_output_len 50 \
              --temperature 1 \
              --tokenizer_dir "/workspace/qllama-7b-chat" \
              --input_text "你好" "你是谁?" \
              --lora_dir "/workspace/offline" \
              --lora_task_uids 0 0 \
              --no_add_special_tokens

the error occurred:

File "/workspace/examples/llama/../run.py", line 285, in main
    outputs = runner.generate(batch_input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 364, in generate
    batch_input_ids, input_lengths = self._prepare_inputs(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 279, in _prepare_inputs
    raise RuntimeError(
RuntimeError: Input batch size (2) exceeds the engine limit (1)
Traceback (most recent call last):
  File "/workspace/examples/llama/../run.py", line 339, in <module>
    main(args)
  File "/workspace/examples/llama/../run.py", line 285, in main
    outputs = runner.generate(batch_input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 364, in generate
    batch_input_ids, input_lengths = self._prepare_inputs(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 279, in _prepare_inputs
    raise RuntimeError(
RuntimeError: Input batch size (2) exceeds the engine limit (1)

Later, I set self.max_batch_size = 2,error occurred:

[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
Traceback (most recent call last):
  File "/workspace/examples/llama/../run.py", line 339, in <module>
    main(args)
  File "/workspace/examples/llama/../run.py", line 285, in main
    outputs = runner.generate(batch_input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 387, in generate
    outputs = self.session.decode(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 639, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2208, in decode
    return self.decode_regular(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1947, in decode_regular
    should_stop, next_step_buffer, tasks, context_lengths, host_context_lengths, attention_mask, logits, encoder_input_lengths = self.handle_per_step(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1712, in handle_per_step
    raise RuntimeError('Executing TRT engine failed!')
RuntimeError: Executing TRT engine failed!

Should I build different engines for different batch sizes?

jkl375 avatar Dec 07 '23 01:12 jkl375

Could you share the script to build engine?

byshiue avatar Dec 07 '23 09:12 byshiue

python build.py --model_dir /workspace/qllama-7b-chat \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir "/tmp/new_lora_7b/trt_engines/fp16/2-gpu/" \
                --max_batch_size 1 \
                --max_input_len 512 \
                --max_output_len 50 \
                --use_lora_plugin float16 \
                --visualize  \
                --hf_lora_dir "/workspace/linker_g0TZGi36_best" \
                --world_size 2 --tp_size 2

/workspace/linker_g0TZGi36_best is lora wight by finetuning

jkl375 avatar Dec 08 '23 05:12 jkl375

If you want to run batch size > 1, you should set max_batch_size during engine building.

byshiue avatar Dec 08 '23 09:12 byshiue

Thank you, but when is it expected to support loading multiple Lora weights?

jkl375 avatar Dec 08 '23 09:12 jkl375

Using multiple Lora weights independently and merged would be a very important feature.

codybum avatar Dec 09 '23 14:12 codybum

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

byshiue avatar Dec 11 '23 06:12 byshiue

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

WangxuP avatar Jan 04 '24 09:01 WangxuP

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

As I mentioned, users need to modify lora_manager.py to load several lora models.

byshiue avatar Jan 05 '24 09:01 byshiue

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

As I mentioned, users need to modify lora_manager.py to load several lora models.

Are there any examples for me to refer to? thank you very much!

WangxuP avatar Jan 05 '24 13:01 WangxuP

No. We don't find better model to prepare the example. If you could share any checkpoints with several lora weights, we are happy to prepare such example.

byshiue avatar Jan 08 '24 03:01 byshiue

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

As I mentioned, users need to modify lora_manager.py to load several lora models.

@byshiue if we modify lora_manager to load multiple lora adapters, do we share the same base model? Or would the model weights for the base model also be repeated?

hchoi-moveworks avatar Jan 23 '24 23:01 hchoi-moveworks

They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.

byshiue avatar Jan 30 '24 08:01 byshiue

它们共享相同的基本模型。我们在这里有一个示例https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints。

ok, thanks a lot!

WangxuP avatar Jan 30 '24 10:01 WangxuP

They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.

Thanks @byshiue !!

In the example 1, build script only specify one hf_lora_dir . Should this be 2 lora dirs?? (To be consistent with run script and the readme description ? )

python build.py --model_dir ${BASE_LLAMA_MODEL} \
              ....
                --hf_lora_dir "Japanese-Alpaca-LoRA-7b-v0/" \

hchoi-moveworks avatar Jan 31 '24 07:01 hchoi-moveworks

They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.

Thanks @byshiue !!

In the example 1, build script only specify one hf_lora_dir . Should this be 2 lora dirs?? (To be consistent with run script and the readme description ? )

python build.py --model_dir ${BASE_LLAMA_MODEL} \
              ....
                --hf_lora_dir "Japanese-Alpaca-LoRA-7b-v0/" \

In engine building, we only use one of lora models to get the common parameters, and users need to setup the max_lora_rank properly based on their lora models. Most lora logics are handled in runtime and that's why we only load multiple lora models in runtime.

byshiue avatar Feb 19 '24 07:02 byshiue

@byshiue What is the implementation of LoRA dynamic switching in TensorRT? Shouldn't it be static after being converted to an engine

Baboom-l avatar May 20 '24 04:05 Baboom-l

The LoRA weights are managed by runtime instead of engine. We pass the pointers of LoRA weights as inputs of the TRT engine. So, we could pass different pointers to switch the LoRA weights dynamically.

byshiue avatar May 23 '24 07:05 byshiue