TensorRT-LLM
TensorRT-LLM copied to clipboard
Model Performance Degraded when using BFLOAT16 LoRa Adapters
System Info
2X L4 GPUs
Docker Image: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
Who can help?
@juney-nvidia @kaiyux
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I have a fine tuned set of weights trained using huggingface:
-
LLama3-8B
-
LoRa
- Rank 32
- rs_lora scaling
-
Rope Scaling
- linear
- factor 1.75
- rotary base / theta = 875000
I have prepared the huggingface safetensor weights using this process
I have updated the base model's config.json
"rope_scaling": {
"type": "linear",
"factor": 1.75
},
"rope_theta": 875000,
Then compiled the base model from within nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
by running:
python3 convert_checkpoint.py \
--model_dir ${BASE_MODEL_DIR} \
--output_dir /converted_base_model \
--rotary_base 875000 \
--dtype bfloat16 \
--tp_size 2
trtllm-build \
--checkpoint_dir /converted_base_model \
--max_input_len=13568 \
--max_num_tokens=14336 \
--max_output_len=768 \
--tp_size 2 \
--max_batch_size 4 \
--max_beam_width 3 \
--lora_plugin bfloat16 \
--gemm_plugin bfloat16 \
--lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_gate mlp_4h_to_h \
--max_lora_rank 32 \
--gpt_attention_plugin bfloat16 \
--paged_kv_cache enable \
--use_paged_context_fmha enable \
--multi_block_mode enable \
--remove_input_padding enable \
--use_custom_all_reduce disable \
--cluster_key L4 \
--workers=2 \
--context_fmha enable \
--lookup_plugin bfloat16 \
--enable_xqa enable \
--output_dir ${ENGINE_DIR}
I am then performing generations using trition-inference-server
using the warmups described above
Generated outputs differ significantly from those by using the same model in huggingface.
If the same process is repeated but the model is first "merged and unloaded" before compilation and then served without LoRa weights I get the exact same output from triton/tensorRT-LLM.
Expected behavior
Outputs of model are the same with LoRa weights as they are with a merged and unloaded model. These results also are expected to nearly match the results when ran in huggingface.
actual behavior
Rouge2 scores between huggingface outputs and LoRa weights served models are below 0.6 (other metrics would also demonstrate the large shift in outputs that are occuring)
additional notes
I noticed that the scale
being applied to the "out" weights in the hf_lora_convert.py
script. It appears that the "A" and "B" matrices (huggingface weights notation) correspond to ("in" and "out") in TensorRT-LLM notation.
From looking at the RS Scaling LoRa [equation 2] paper it seems that I should be able to get the same results from applying the scaling to "A/in" OR "B/out". In practice applying scaling to B gets results similar to the fine-tuning objective (but still significantly shifted) but applying scale to only the "A/in" result in seemingly random token generation.