vllm [Bug]: Capture CudaGraph with LoRA

trafficstars

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

🐛 Describe the bug

when I use LoRA with enabel_eager=False(which means it should capture cudaGraph), I find the below code could cause problem(in vllm/vllm/worker/model_runner.py):

if self.lora_config:
    lora_mapping = LoRAMapping(
        **dict(index_mapping=[0] * batch_size,
               prompt_mapping=[0] * batch_size,
               is_prefill=False))
    self.set_active_loras(set(), lora_mapping)

then I print token_lora_indices by self.lora_manager._adapter_manager.punica_wrapper._token_lora_indices, but only get tensor([-1, -1, -1, ..., 0, 0, 0], device='cuda:0'). A token with LoRA_indices=-1 seems not right.

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Mar 19 '25 05:03 chenhongyu2048

I don't think LoRA should be captured in CUDA Graph, especially in the case you might want to switch multiple different loras.

What is the behaviour that you observed with enable_eager=True?

Mar 19 '25 05:03 aarnphm

-1 means no lora be applied.

Mar 19 '25 05:03 jeejeelee

I defined a model myself and called bgmv in it to do some LoRA calculations, so indices=-1 resulted in cuda error.

I don't think LoRA should be captured in CUDA Graph, especially in the case you might want to switch multiple different loras.

What is the behaviour that you observed with enable_eager=True?

if I run it with enable_eager=True, the total model will not be captured by capture function, so indices=-1 will not appear.

-1 means no lora be applied.

Seems I should modify my implementation for situtations like indices=-1.

Mar 19 '25 06:03 chenhongyu2048

the total model will not be captured by well, yes, that's the behaviour of enforce_eager=True...

correct me if i'm wrong, but you shouldn't capture the cuda graph for LoRA?

Mar 19 '25 11:03 aarnphm

Yes I don't want lora to be captured indeed. I think my error was caused by my missuse of bgmv kernel.

Mar 19 '25 13:03 chenhongyu2048

It is supported in https://github.com/vllm-project/vllm/pull/14626

Can you try again with 0.8.0?

Mar 19 '25 15:03 aarnphm

I'm currently using version 0.7.2
I think I'll try cudagraph for lora of version 0.8 in the future

Also I'd like to ask a question, enabling cudagraph for lora doesn't seem to require too many code changes? This has provided me with a lot of convenience in learning your code😊

Mar 20 '25 06:03 chenhongyu2048

The 0.7.2 version should still be the V0 version of LoRA. For V0, vllm only captures cudagraph during the decode stage, and lora supports cudagraph, which you can confirm through torch.profiler.

Mar 20 '25 07:03 jeejeelee

The 0.7.2 version should still be the V0 version of LoRA. For V0, vllm only captures cudagraph during the decode stage, and lora supports cudagraph, which you can confirm through torch.profiler.

Thanks for your reply! I'm curious about the difference between LoRA V0 and V1, for this I read the description in https://github.com/vllm-project/vllm/pull/13096, but I'm still confused by the sentence "V1 doesn't group requests based on LoRA ID. The new set of kernels have information about which input tokens map to which LoRA ID and they use this information to load the appropriate input tokens. " this looks like the implementation about bgmv?

Mar 20 '25 17:03 chenhongyu2048

From Slack, it seems to be both supported for v0 and v1 once you upgrade to latest vllm.

Mar 20 '25 19:03 aarnphm

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Jun 20 '25 02:06 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Jul 20 '25 02:07 github-actions[bot]

vllm vllm copied to clipboard

[Bug]: Capture CudaGraph with LoRA

Your current environment

🐛 Describe the bug

Before submitting a new issue...

vllm
vllm copied to clipboard