TensorRT-LLM Model Performance Degraded when using BFLOAT16 LoRa Adapters

System Info

2X L4 GPUs

Docker Image: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3

Who can help?

@juney-nvidia @kaiyux

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I have a fine tuned set of weights trained using huggingface:

LLama3-8B
LoRa
- Rank 32
- rs_lora scaling
Rope Scaling
- linear
- factor 1.75
- rotary base / theta = 875000
I have prepared the huggingface safetensor weights using this process

I have updated the base model's config.json

   "rope_scaling": {
        "type": "linear",
        "factor": 1.75
    },
    "rope_theta": 875000,

Then compiled the base model from within nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3

by running:

python3 convert_checkpoint.py \
--model_dir ${BASE_MODEL_DIR} \
--output_dir /converted_base_model \
--rotary_base 875000 \
--dtype bfloat16 \
--tp_size 2


trtllm-build \
--checkpoint_dir /converted_base_model \
--max_input_len=13568 \
--max_num_tokens=14336 \
--max_output_len=768 \
--tp_size 2 \
--max_batch_size 4 \
--max_beam_width 3 \
--lora_plugin bfloat16 \
--gemm_plugin bfloat16 \
--lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_gate mlp_4h_to_h \
--max_lora_rank 32 \
--gpt_attention_plugin bfloat16 \
--paged_kv_cache enable \
--use_paged_context_fmha enable \
--multi_block_mode enable \
--remove_input_padding enable \
--use_custom_all_reduce disable \
--cluster_key L4 \
--workers=2 \
--context_fmha enable \
--lookup_plugin bfloat16 \
--enable_xqa enable \
--output_dir ${ENGINE_DIR}

I am then performing generations using trition-inference-server using the warmups described above

Generated outputs differ significantly from those by using the same model in huggingface.

If the same process is repeated but the model is first "merged and unloaded" before compilation and then served without LoRa weights I get the exact same output from triton/tensorRT-LLM.

Expected behavior

Outputs of model are the same with LoRa weights as they are with a merged and unloaded model. These results also are expected to nearly match the results when ran in huggingface.

actual behavior

Rouge2 scores between huggingface outputs and LoRa weights served models are below 0.6 (other metrics would also demonstrate the large shift in outputs that are occuring)

additional notes

I noticed that the scale being applied to the "out" weights in the hf_lora_convert.py script. It appears that the "A" and "B" matrices (huggingface weights notation) correspond to ("in" and "out") in TensorRT-LLM notation.

From looking at the RS Scaling LoRa [equation 2] paper it seems that I should be able to get the same results from applying the scaling to "A/in" OR "B/out". In practice applying scaling to B gets results similar to the fine-tuning objective (but still significantly shifted) but applying scale to only the "A/in" result in seemingly random token generation.

Jul 16 '24 18:07 TheCodeWrangler

@kaiyux same issue with llama-3-8b added rope scaling, can you help to solve this problem?

Jul 19 '24 04:07 fan-niu

Thanks for reporting this, our engineer will start looking into this issue soon.

Jul 19 '24 23:07 juney-nvidia

Any updates?! I see a new issue that looks the same as well but in my case I have now tried with the 24.07 tag and the results are the same

Jul 27 '24 01:07 TheCodeWrangler

Wondering if there is any progress?

Aug 08 '24 14:08 TheCodeWrangler

In the bug description, I did not see which LoRA was used. could you please tell me ? It's better to offer the huggingface link of the base model and LoRA model.

Thanks.

Sep 11 '24 09:09 zongfeijing

Triaged to @VincentJing . @TheCodeWrangler pls share the information asked by VincentJing above.

Sep 11 '24 17:09 juney-nvidia

In the bug description, I did not see which LoRA was used. could you please tell me ? It's better to offer the huggingface link of the base model and LoRA model.

Thanks.

I was using a finetuned rank 32 adapter trained with rs_lora

Unfortunately i do not think i can share the actual weights

Not sure it helps but here is the adapter_config.json

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "meta-llama/Meta-Llama-3-8B",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 64,
  "lora_dropout": 0.0,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 32,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "k_proj",
    "v_proj",
    "down_proj",
    "o_proj",
    "up_proj",
    "q_proj",
    "gate_proj"
  ],
  "task_type": "CAUSAL_LM",
  "use_dora": false,
  "use_rslora": true
}

Sep 11 '24 19:09 TheCodeWrangler

Any insights gained from knowing that

alphaA != alphaB

When scaling the weights?

Sep 11 '24 19:09 TheCodeWrangler

Hi, @TheCodeWrangler, Have you solved this issue in the latest version? If not, could you please provide a script to reproduce this issue?

Nov 18 '24 05:11 zongfeijing

@TheCodeWrangler any updates on this?

May 21 '25 22:05 poweiw

@TheCodeWrangler any updates on this?

I actually was blocked on this for a deployment I needed. I ended up changing base frameworks to vllm in order to move forward with deployments.

I last tested around 24.11 and was still seeing the same behavior at that point in time.

May 28 '25 17:05 TheCodeWrangler

I think for reproducing the issue:

Get any weights which were trained and apply the alpha value to the A matrix and then retry with applying them to the B matrix

I was finding that any a test prompt would have its responses extremely altered if the scaling was done on the B matrix.

May 28 '25 17:05 TheCodeWrangler

@TheCodeWrangler , thank you for sharing your observation! I'm closing this issue as stale for now, if you get a chance to try the latest release and the problem persists, please feel free to open a new one. Thank you!

Nov 14 '25 18:11 karljang