vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.

Open rtx-8000 opened this issue 10 months ago • 2 comments

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I've noticed that using LoRA with rank=256 significantly slows down inference by 4x, as shown below. However, reducing the rank to 8 or 16 brings performance closer to that of no LoRA. I'm currently using two fully-utilized GPUs, without the enforce_eager flag, and have set the maximum LoRA rank accordingly. Interestingly, adjusting the maximum model length had no impact on performance. What steps can I take to optimize performance?

No Lora

Processed prompts: 0%|▏ | 5/2430 [01:28<6:58:39, 10.36s/it, est. speed input: 3.71 toks/s, output: 2.34 toks/s]Processed prompts: 10%|█████▊ | 240/2430 [05:09<44:09, 1.21s/it, est. speed input: 87.79 toks/s, output: 90.18 toks/s]WARNING 03-06 17:12:30 scheduler.py:1754] Sequence group 352 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51 Processed prompts: 20%|███████████▏ | 476/2430 [09:38<39:30, 1.21s/it, est. speed input: 106.63 toks/s, output: 117.32 toks/s]^

Lora rank = 16

Processed prompts: 0%| | 0/2430 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 03-07 11:35:15 scheduler.py:1754] Sequence group 238 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1 Processed prompts: 0%| | 3/2430 [01:24<13:43:22, 20.36s/it, est. speed input: 2.31 toks/s, output: 1.25 toks/s]WARNING 03-07 11:36:05 scheduler.py:1754] Sequence group 187 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51 Processed prompts: 11%|██████▎ | 262/2430 [06:11<42:31, 1.18s/it, est. speed input: 84.40 toks/s, output: 88.40 toks/s]WARNING 03-07 11:40:46 scheduler.py:1754] Sequence group 342 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=101 Processed prompts: 18%|██████████▍ | 437/2430 [10:07<43:53, 1.32s/it, est. speed input: 96.26 toks/s, output: 105.08 toks/s]WARNING 03-07 11:44:38 scheduler.py:1754] Sequence group 569 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=151

Lora rank = 256

Processed prompts: 0%| | 0/2430 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 03-06 17:25:54 scheduler.py:1754] Sequence group 255 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1 Processed prompts: 0%| | 4/2430 [02:52<20:13:48, 30.02s/it, est. speed input: 1.50 toks/s, output: 0.86 toks/s]Processed prompts: 10%|█████▊ | 246/2430 [10:13<1:19:59, 2.20s/it, est. speed input: 45.74 toks/s, output: 46.86 toks/s]WARNING 03-06 17:34:07 scheduler.py:1754] Sequence group 356 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51 Processed prompts: 20%|███████████▌ | 476/2430 [18:01<47:13, 1.45s/it, est. speed input: 57.00 toks/s, output: 61.91 toks/s]

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

rtx-8000 avatar Mar 07 '25 11:03 rtx-8000

We have a benchmark result at slack_lora_thread. We are aware of this issue and will optimizing the lora performance. Could you please provide your model and LoRA config?

jeejeelee avatar Mar 07 '25 14:03 jeejeelee

Model: llama-3.3-70b-instruct-awq LoRA config:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "/u01/app/mlo/models/Llama-3.3-70B-Instruct",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 512,
  "lora_dropout": 0.0,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 256,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "gate_proj",
    "v_proj",
    "down_proj",
    "q_proj",
    "k_proj",
    "o_proj",
    "up_proj"
  ],
  "task_type": "CAUSAL_LM",
  "use_dora": false,
  "use_rslora": false
}

rtx-8000 avatar Mar 07 '25 14:03 rtx-8000

Hello, I ran more tests setting max_num_seqs to 1. I am now getting worse result for adapter of rank 16 than for rank 256. They both have the same configuration, i.e. same target_modules.

rtx-8000 avatar Mar 11 '25 16:03 rtx-8000

Hi @rtx-8000, if can you can use the nightly, can you try setting the environment variable VLLM_USE_V1=1 . for example,

 VLLM_USE_V1="1" vllm serve  meta-llama/Llama-2-7b-hf --enable-lora --max-loras 4 --max-lora-rank 256 --lora-modules "lora0"="yard1/llama-2-7b-sql-lora-test" "lora1"="yard1/llama-2-7b-sql-lora-test" "lora2"="yard1/llama-2-7b-sql-lora-test" "lora3"="yard1/llama-2-7b-sql-lora-test"

I see that for lower ranks, VLLM_USE_V1="0" is slightly better. But VLLM_USE_V1="1" doesn't seem to be affected by the max-lora-rank as much. https://github.com/vllm-project/vllm/pull/14626 should make the low rank case better 🤞 cc @jeejeelee

Thank you for your comment. When running the following command:

VLLM_USE_V1="1" python scripts/vllm_infer.py --model_name_or_path /u01/data/analytics/models/llama-3.3-70b-instruct-awq/ --adapter_name_or_path saves/llama3.3-70b/fsdp_qlora_aug_tag_r256/sft/ --dataset sft_dataset_aug_tag --vllm_config "{gpu_memory_utilization: 0.6, max_model_len: 700, max_seq_len_to_capture: 700, max_lora_rank:256, max_num_seqs: 1}"

I got the following errors:

(VllmWorker rank=0 pid=345921) ERROR 03-12 08:59:00 utils.py:608] Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9

(VllmWorker rank=0 pid=345921) ERROR 03-12 08:59:39 multiproc_executor.py:374] ValueError: Unsupported FA version: None

FYI, vllm_infer.py.

rtx-8000 avatar Mar 12 '25 08:03 rtx-8000

I will test #14626 asap, and will provibe the test result here. @rtx-8000 @varun-sundar-rabindranath

jeejeelee avatar Mar 12 '25 08:03 jeejeelee

Thank you @jeejeelee . I also noticed something weird, I am testing on 2 datasets with one with longer sequences. When I run the model with Lora on shorter sequence the model is like 2/3 times slower.

rtx-8000 avatar Mar 12 '25 12:03 rtx-8000

Hey @rtx-8000 can you post your output of python collect_env.py please.

Image

Image

jeejeelee avatar Mar 14 '25 07:03 jeejeelee

So what could be wrong for me to have slower performance for lower rank, especially for max_num_seqs (request rate) = 1 or no such big difference for higher rate?

rtx-8000 avatar Mar 17 '25 09:03 rtx-8000

So what could be wrong for me to have slower performance for lower rank, especially for max_num_seqs (request rate) = 1 or no such big difference for higher rate?

Cloud you plz provibe your running script?

jeejeelee avatar Mar 18 '25 02:03 jeejeelee

python scripts/vllm_infer.py --model_name_or_path /u01/data/analytics/models/llama-3.3-70b-instruct-awq/ --adapter_name_or_path saves/llama3.3-70b/fsdp_qlora_aug_tag_r256/sft/ --dataset sft_dataset_aug_tag --vllm_config "{gpu_memory_utilization: 0.6, max_model_len: 1024, max_lora_rank: 256, max_num_seqs: 1"}

and this is the vllm_infer.py

and this the GPUs I have

+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Quadro RTX 8000 Off | 00000000:37:00.0 Off | 0 | | 45% 68C P2 152W / 260W | 24223MiB / 46080MiB | 62% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Quadro RTX 8000 Off | 00000000:D8:00.0 Off | 0 | | 74% 86C P2 246W / 260W | 28247MiB / 46080MiB | 86% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

and vllm version 0.7.3

rtx-8000 avatar Mar 18 '25 16:03 rtx-8000

So what could be wrong for me to have slower performance for lower rank, especially for max_num_seqs (request rate) = 1 or no such big difference for higher rate?

@rtx-8000 as you may already know, max_num_seqs is the maximum number of sequences that the scheduler can schedule in one iteration. With max_num_seqs set to 1, vLLM has very little scope for parallelization. for V0, the default max_num_seqs in main is 256.

Given a large enough set of inputs, the performance should be increase as you increase max_num_seqs upto a point. Do you not see this in your experiments ?

also,

I've noticed that using LoRA with rank=256 significantly slows down inference by 4x, as shown below. However, reducing the rank to 8 or 16 brings performance closer to that of no LoRA.

Can you verify if you are still seeing this on main ? or the latest nightly ? Thanks 🙌

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Jun 23 '25 02:06 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Jul 23 '25 02:07 github-actions[bot]