Model Performance Degraded when using BFLOAT16 LoRa Adapters
System Info
2X L4 GPUs
Docker Image: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
Who can help?
@juney-nvidia @kaiyux
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I have a fine tuned set of weights trained using huggingface:
-
LLama3-8B
-
LoRa
- Rank 32
- rs_lora scaling
-
Rope Scaling
- linear
- factor 1.75
- rotary base / theta = 875000
I have prepared the huggingface safetensor weights using this process
I have updated the base model's config.json
"rope_scaling": {
"type": "linear",
"factor": 1.75
},
"rope_theta": 875000,
Then compiled the base model from within nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
by running:
python3 convert_checkpoint.py \
--model_dir ${BASE_MODEL_DIR} \
--output_dir /converted_base_model \
--rotary_base 875000 \
--dtype bfloat16 \
--tp_size 2
trtllm-build \
--checkpoint_dir /converted_base_model \
--max_input_len=13568 \
--max_num_tokens=14336 \
--max_output_len=768 \
--tp_size 2 \
--max_batch_size 4 \
--max_beam_width 3 \
--lora_plugin bfloat16 \
--gemm_plugin bfloat16 \
--lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_gate mlp_4h_to_h \
--max_lora_rank 32 \
--gpt_attention_plugin bfloat16 \
--paged_kv_cache enable \
--use_paged_context_fmha enable \
--multi_block_mode enable \
--remove_input_padding enable \
--use_custom_all_reduce disable \
--cluster_key L4 \
--workers=2 \
--context_fmha enable \
--lookup_plugin bfloat16 \
--enable_xqa enable \
--output_dir ${ENGINE_DIR}
I am then performing generations using trition-inference-server using the warmups described above
Generated outputs differ significantly from those by using the same model in huggingface.
If the same process is repeated but the model is first "merged and unloaded" before compilation and then served without LoRa weights I get the exact same output from triton/tensorRT-LLM.
Expected behavior
Outputs of model are the same with LoRa weights as they are with a merged and unloaded model. These results also are expected to nearly match the results when ran in huggingface.
actual behavior
Rouge2 scores between huggingface outputs and LoRa weights served models are below 0.6 (other metrics would also demonstrate the large shift in outputs that are occuring)
additional notes
I noticed that the scale being applied to the "out" weights in the hf_lora_convert.py script. It appears that the "A" and "B" matrices (huggingface weights notation) correspond to ("in" and "out") in TensorRT-LLM notation.
From looking at the RS Scaling LoRa [equation 2] paper it seems that I should be able to get the same results from applying the scaling to "A/in" OR "B/out". In practice applying scaling to B gets results similar to the fine-tuning objective (but still significantly shifted) but applying scale to only the "A/in" result in seemingly random token generation.
@kaiyux same issue with llama-3-8b added rope scaling, can you help to solve this problem?
Thanks for reporting this, our engineer will start looking into this issue soon.
Any updates?! I see a new issue that looks the same as well but in my case I have now tried with the 24.07 tag and the results are the same
Wondering if there is any progress?
In the bug description, I did not see which LoRA was used. could you please tell me ? It's better to offer the huggingface link of the base model and LoRA model.
Thanks.
Triaged to @VincentJing . @TheCodeWrangler pls share the information asked by VincentJing above.
In the bug description, I did not see which LoRA was used. could you please tell me ? It's better to offer the huggingface link of the base model and LoRA model.
Thanks.
I was using a finetuned rank 32 adapter trained with rs_lora
Unfortunately i do not think i can share the actual weights
Not sure it helps but here is the adapter_config.json
{
"alpha_pattern": {},
"auto_mapping": null,
"base_model_name_or_path": "meta-llama/Meta-Llama-3-8B",
"bias": "none",
"fan_in_fan_out": false,
"inference_mode": true,
"init_lora_weights": true,
"layer_replication": null,
"layers_pattern": null,
"layers_to_transform": null,
"loftq_config": {},
"lora_alpha": 64,
"lora_dropout": 0.0,
"megatron_config": null,
"megatron_core": "megatron.core",
"modules_to_save": null,
"peft_type": "LORA",
"r": 32,
"rank_pattern": {},
"revision": null,
"target_modules": [
"k_proj",
"v_proj",
"down_proj",
"o_proj",
"up_proj",
"q_proj",
"gate_proj"
],
"task_type": "CAUSAL_LM",
"use_dora": false,
"use_rslora": true
}
Any insights gained from knowing that
alphaA != alphaB
When scaling the weights?
Hi, @TheCodeWrangler, Have you solved this issue in the latest version? If not, could you please provide a script to reproduce this issue?
@TheCodeWrangler any updates on this?
@TheCodeWrangler any updates on this?
I actually was blocked on this for a deployment I needed. I ended up changing base frameworks to vllm in order to move forward with deployments.
I last tested around 24.11 and was still seeing the same behavior at that point in time.
I think for reproducing the issue:
Get any weights which were trained and apply the alpha value to the A matrix and then retry with applying them to the B matrix
I was finding that any a test prompt would have its responses extremely altered if the scaling was done on the B matrix.
@TheCodeWrangler , thank you for sharing your observation! I'm closing this issue as stale for now, if you get a chance to try the latest release and the problem persists, please feel free to open a new one. Thank you!